WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting
Olly Styles, Sam Miller, Patricio Cerda-Mardini, Tanaya Guha, Victor Sanchez, Bertie Vidgen
TL;DR
WorkBench addresses a gap in evaluating autonomous workplace agents by introducing an outcome-centric benchmark with 690 tasks across five domains in a sandbox environment. Agents operate with 26 tools to perform multi-step tasks, and each task has a unique ground-truth outcome enabling robust automatic evaluation. The study demonstrates that state-of-the-art models (GPT-4) achieve only about 43% accuracy (49% with resampling), highlighting current weaknesses in planning, tool selection, and handling tool constraints in high-stakes settings. The dataset is open-source and designed to scale with additional domains and tasks, paving the way for more realistic and rigorous evaluation of workplace AI agents.
Abstract
We introduce WorkBench: a benchmark dataset for evaluating agents' ability to execute tasks in a workplace setting. WorkBench contains a sandbox environment with five databases, 26 tools, and 690 tasks. These tasks represent common business activities, such as sending emails and scheduling meetings. The tasks in WorkBench are challenging as they require planning, tool selection, and often multiple actions. If a task has been successfully executed, one (or more) of the database values may change. The correct outcome for each task is unique and unambiguous, which allows for robust, automated evaluation. We call this key contribution outcome-centric evaluation. We evaluate five existing ReAct agents on WorkBench, finding they successfully complete as few as 3% of tasks (Llama2-70B), and just 43% for the best-performing (GPT-4). We further find that agents' errors can result in the wrong action being taken, such as an email being sent to the wrong person. WorkBench reveals weaknesses in agents' ability to undertake common business activities, raising questions about their use in high-stakes workplace settings. WorkBench is publicly available as a free resource at https://github.com/olly-styles/WorkBench.
