Search Results for author: Sam Miller

Found 1 papers, 1 papers with code

WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting

1 code implementation • 1 May 2024 • Olly Styles, Sam Miller, Patricio Cerda-Mardini, Tanaya Guha, Victor Sanchez, Bertie Vidgen

We evaluate five existing ReAct agents on WorkBench, finding they successfully complete as few as 3% of tasks (Llama2-70B), and just 43% for the best-performing (GPT-4).

Scheduling

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.