Search Results for author: Lizhen Xu

Found 2 papers, 2 papers with code

Multiple-Choice Questions are Efficient and Robust LLM Evaluators

1 code implementation • 20 May 2024 • Ziyin Zhang, Lizhen Xu, Zhaokun Jiang, Hongkun Hao, Rui Wang

We present GSM-MC and MATH-MC, two multiple-choice (MC) datasets constructed by collecting answers and incorrect predictions on GSM8K and MATH from over 50 open-source models.

GSM8K Math +1

Paper
Code

R-Judge: Benchmarking Safety Risk Awareness for LLM Agents

1 code implementation • 18 Jan 2024 • Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, Gongshen Liu

We introduce R-Judge, a benchmark crafted to evaluate the proficiency of LLMs in judging and identifying safety risks given agent interaction records.

Benchmarking

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.