Search Results for author: Lizhen Xu

Found 2 papers, 2 papers with code

Multiple-Choice Questions are Efficient and Robust LLM Evaluators

1 code implementation20 May 2024 Ziyin Zhang, Lizhen Xu, Zhaokun Jiang, Hongkun Hao, Rui Wang

We present GSM-MC and MATH-MC, two multiple-choice (MC) datasets constructed by collecting answers and incorrect predictions on GSM8K and MATH from over 50 open-source models.

GSM8K Math +1

R-Judge: Benchmarking Safety Risk Awareness for LLM Agents

1 code implementation18 Jan 2024 Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, Gongshen Liu

We introduce R-Judge, a benchmark crafted to evaluate the proficiency of LLMs in judging and identifying safety risks given agent interaction records.

Benchmarking

Cannot find the paper you are looking for? You can Submit a new open access paper.