Table of Contents
Fetching ...

WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks

Léo Boisvert, Megh Thakkar, Maxime Gasse, Massimo Caccia, Thibault Le Sellier De Chezelles, Quentin Cappart, Nicolas Chapados, Alexandre Lacoste, Alexandre Drouin

TL;DR

WorkArena++ introduces a large-scale, compositional benchmark for web agents operating in enterprise software,built on the ServiceNow platform. It pairs two harder levels (L2 and L3) with five skill categories to probe planning, retrieval, reasoning, memorization, and infeasibility handling, supported by a standardized curriculum and ground-truth interaction traces. Empirical results show current state-of-the-art LLM/VLM agents struggle substantially on WorkArena++, while humans solve tasks with high success, highlighting gaps in planning, memory, and cross-modal understanding. The benchmark, along with its trace-extraction framework and visual-diversity design, offers a scalable path to advancing autonomous knowledge-work agents and generating fine-tuning data for future models.

Abstract

The ability of large language models (LLMs) to mimic human-like intelligence has led to a surge in LLM-based autonomous agents. Though recent LLMs seem capable of planning and reasoning given user instructions, their effectiveness in applying these capabilities for autonomous task solving remains underexplored. This is especially true in enterprise settings, where automated agents hold the promise of a high impact. To fill this gap, we propose WorkArena++, a novel benchmark consisting of 682 tasks corresponding to realistic workflows routinely performed by knowledge workers. WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents. Our empirical studies across state-of-the-art LLMs and vision-language models (VLMs), as well as human workers, reveal several challenges for such models to serve as useful assistants in the workplace. In addition to the benchmark, we provide a mechanism to effortlessly generate thousands of ground-truth observation/action traces, which can be used for fine-tuning existing models. Overall, we expect this work to serve as a useful resource to help the community progress toward capable autonomous agents. The benchmark can be found at https://github.com/ServiceNow/WorkArena.

WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks

TL;DR

WorkArena++ introduces a large-scale, compositional benchmark for web agents operating in enterprise software,built on the ServiceNow platform. It pairs two harder levels (L2 and L3) with five skill categories to probe planning, retrieval, reasoning, memorization, and infeasibility handling, supported by a standardized curriculum and ground-truth interaction traces. Empirical results show current state-of-the-art LLM/VLM agents struggle substantially on WorkArena++, while humans solve tasks with high success, highlighting gaps in planning, memory, and cross-modal understanding. The benchmark, along with its trace-extraction framework and visual-diversity design, offers a scalable path to advancing autonomous knowledge-work agents and generating fine-tuning data for future models.

Abstract

The ability of large language models (LLMs) to mimic human-like intelligence has led to a surge in LLM-based autonomous agents. Though recent LLMs seem capable of planning and reasoning given user instructions, their effectiveness in applying these capabilities for autonomous task solving remains underexplored. This is especially true in enterprise settings, where automated agents hold the promise of a high impact. To fill this gap, we propose WorkArena++, a novel benchmark consisting of 682 tasks corresponding to realistic workflows routinely performed by knowledge workers. WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents. Our empirical studies across state-of-the-art LLMs and vision-language models (VLMs), as well as human workers, reveal several challenges for such models to serve as useful assistants in the workplace. In addition to the benchmark, we provide a mechanism to effortlessly generate thousands of ground-truth observation/action traces, which can be used for fine-tuning existing models. Overall, we expect this work to serve as a useful resource to help the community progress toward capable autonomous agents. The benchmark can be found at https://github.com/ServiceNow/WorkArena.
Paper Structure (62 sections, 53 figures, 3 tables, 1 algorithm)

This paper contains 62 sections, 53 figures, 3 tables, 1 algorithm.

Figures (53)

  • Figure 1: Example WorkArena++ task: Restock low inventory items. Here, the agent acts as an IT worker tasked with restocking items that are below some threshold in stock: ① As is common, it receives instructions via a ticket assigned to them in the system; ② it must then read the dashboard to extract all items whose stock count is low; ③ reorder the items from the service catalog to match a minimum stock quantity, and ④ close the ticket assigned to them once the task is completed.
  • Figure 2: Background: (a) In WorkArena, tasks measure the ability of web agents to interact with basic UI components in the ServiceNow platform, illustrated above. (b) In BrowserGym, the agent receives a natural-language goal from a human user via chat. It then perceives the environment (web browser) through a set of multimodal observations (e.g., HTML and screenshot) and controls it via a standardized set of available actions. Reproduced from drouin2024workarena with permission.
  • Figure 3: In WorkArena(++), the agent interacts with the frontend of a remote-hosted ServiceNow instance via BrowserGym. Task validation then inspects both the state of the database and any open page using backend (REST) and frontend (JS) https://developer.servicenow.com/dev.do#!/reference.
  • Figure 4: Overview of WorkArena++: a) Distribution of tasks across the skills introduced in \ref{['sec:wapp-skills']} for all 682 tasks in the L2/L3 sets. b) Task length as estimated by the number of actions required for completion by the Oracle (see \ref{['sec:wa']}) for all 470 L2/L3 task instances in the agent curriculum (\ref{['sec:evaluation_curriculum']}). Tasks from the L1 set are also included for comparison (33 tasks x 5 seeds).
  • Figure 5: Consent form signed by all human evaluators prior to participating in the study.
  • ...and 48 more figures