Table of Contents
Fetching ...

WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting

Olly Styles, Sam Miller, Patricio Cerda-Mardini, Tanaya Guha, Victor Sanchez, Bertie Vidgen

TL;DR

WorkBench addresses a gap in evaluating autonomous workplace agents by introducing an outcome-centric benchmark with 690 tasks across five domains in a sandbox environment. Agents operate with 26 tools to perform multi-step tasks, and each task has a unique ground-truth outcome enabling robust automatic evaluation. The study demonstrates that state-of-the-art models (GPT-4) achieve only about 43% accuracy (49% with resampling), highlighting current weaknesses in planning, tool selection, and handling tool constraints in high-stakes settings. The dataset is open-source and designed to scale with additional domains and tasks, paving the way for more realistic and rigorous evaluation of workplace AI agents.

Abstract

We introduce WorkBench: a benchmark dataset for evaluating agents' ability to execute tasks in a workplace setting. WorkBench contains a sandbox environment with five databases, 26 tools, and 690 tasks. These tasks represent common business activities, such as sending emails and scheduling meetings. The tasks in WorkBench are challenging as they require planning, tool selection, and often multiple actions. If a task has been successfully executed, one (or more) of the database values may change. The correct outcome for each task is unique and unambiguous, which allows for robust, automated evaluation. We call this key contribution outcome-centric evaluation. We evaluate five existing ReAct agents on WorkBench, finding they successfully complete as few as 3% of tasks (Llama2-70B), and just 43% for the best-performing (GPT-4). We further find that agents' errors can result in the wrong action being taken, such as an email being sent to the wrong person. WorkBench reveals weaknesses in agents' ability to undertake common business activities, raising questions about their use in high-stakes workplace settings. WorkBench is publicly available as a free resource at https://github.com/olly-styles/WorkBench.

WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting

TL;DR

WorkBench addresses a gap in evaluating autonomous workplace agents by introducing an outcome-centric benchmark with 690 tasks across five domains in a sandbox environment. Agents operate with 26 tools to perform multi-step tasks, and each task has a unique ground-truth outcome enabling robust automatic evaluation. The study demonstrates that state-of-the-art models (GPT-4) achieve only about 43% accuracy (49% with resampling), highlighting current weaknesses in planning, tool selection, and handling tool constraints in high-stakes settings. The dataset is open-source and designed to scale with additional domains and tasks, paving the way for more realistic and rigorous evaluation of workplace AI agents.

Abstract

We introduce WorkBench: a benchmark dataset for evaluating agents' ability to execute tasks in a workplace setting. WorkBench contains a sandbox environment with five databases, 26 tools, and 690 tasks. These tasks represent common business activities, such as sending emails and scheduling meetings. The tasks in WorkBench are challenging as they require planning, tool selection, and often multiple actions. If a task has been successfully executed, one (or more) of the database values may change. The correct outcome for each task is unique and unambiguous, which allows for robust, automated evaluation. We call this key contribution outcome-centric evaluation. We evaluate five existing ReAct agents on WorkBench, finding they successfully complete as few as 3% of tasks (Llama2-70B), and just 43% for the best-performing (GPT-4). We further find that agents' errors can result in the wrong action being taken, such as an email being sent to the wrong person. WorkBench reveals weaknesses in agents' ability to undertake common business activities, raising questions about their use in high-stakes workplace settings. WorkBench is publicly available as a free resource at https://github.com/olly-styles/WorkBench.
Paper Structure (30 sections, 7 figures, 7 tables)

This paper contains 30 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Agents in the workplace. A sample task from WorkBench, the first dataset for evaluating autonomous agents on realistic workplace tasks.
  • Figure 2: Our complete pipeline for evaluating agents.1) Sandbox environment: The sandbox has an initial state, defined by five databases. 2) Task: a request is sent by the user. 3) Task execution: a task is sent to the agent, which has access to toolkits in various domains. The agent takes actions using these tools, which may alter the sandbox databases. The agent observes the result of using the tool to determine if more actions are required. 4) Outcome-centric evaluation: the updated sandbox databases are compared against the ground truth.
  • Figure 3: Task and outcome creation. The left side shows a pair of task-and-outcome templates. The outcome template is a function that returns the ground truth for the changes to the sandbox databases, given correct task completion. The right side shows a task-outcome pair created from these templates. In this example, the correct outcome is that the next meeting with Carlos is no longer in the Calendar sandbox.
  • Figure 4: Ground truth number of actions required to complete a task. 18% of tasks require no actions. Sometimes agent would use retrieval tools, but would not need to execute any actions. For example: a request to cancel meetings on a date when there aren't any scheduled. These tasks are easier, which we show in Appendix \ref{['appendix_no_action']}.
  • Figure 5: Outcome-Centric Evaluation. We propose outcome-centric evaluation, where there is a unique ground-truth outcome for each task (lower panel). We consider the task correctly executed if the predicted outcome following the agent's actions matches this outcome. This allows the agent to find multiple paths to the correct outcome, unlike prior works (upper panel) which evaluate the agent's function calls.
  • ...and 2 more figures