Table of Contents
Fetching ...

WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks

Atsuyuki Miyai, Zaiying Zhao, Kazuki Egashira, Atsuki Sato, Tatsumi Sunada, Shota Onohara, Hiromasa Yamanishi, Mashiro Toyooka, Kunato Nishina, Ryoma Maeda, Kiyoharu Aizawa, Toshihiko Yamasaki

TL;DR

WebChoreArena extends the WebArena benchmark with 532 meticulously curated tasks to push web-browsing agents beyond general browsing into memory-, calculation-, and long-term memory–driven chores. Built on the same simulation framework as WebArena, it enables reproducible, fair comparisons and clearer differentiation among large language model–based agents. Experimental results show substantial performance gaps for GPT-4o and meaningful yet incomplete improvements for Claude 3.7 Sonnet and Gemini 2.5 Pro, underscoring the remaining challenges in real-world-like tedious tasks. The benchmark highlights the importance of memory management, task design, and model–agent interactions, and points to future work in extending to online environments and more diverse UI domains.

Abstract

Powered by a large language model (LLM), a web browsing agent operates web browsers in a human-like manner and offers a highly transparent path toward automating a wide range of everyday tasks. As web agents become increasingly capable and demonstrate proficiency in general browsing tasks, a critical question emerges: Can they go beyond general browsing to robustly handle tasks that are tedious and complex, or chores that humans often avoid doing themselves? In this paper, we introduce WebChoreArena, a new fully reproducible benchmark comprising 532 carefully curated tasks designed to extend the scope of WebArena beyond general browsing to more labor-intensive and tedious tasks. WebChoreArena systematically integrates three key challenges: (i) Massive Memory tasks requiring accurate retrieval of large amounts of information in the observations, (ii) Calculation tasks demanding precise mathematical reasoning, and (iii) Long-Term Memory tasks necessitating long-term memory across multiple webpages. Built on top of the fully reproducible and widely adopted four WebArena simulation environments, WebChoreArena ensures strict reproducibility and enables fair, direct comparisons with the established WebArena benchmark, offering key insights into agent progress. Our experimental results demonstrate that as LLMs evolve, represented by GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro, significant improvements in performance are observed on WebChoreArena. These findings suggest that WebChoreArena is well-suited to measure the advancement of state-of-the-art LLMs with greater clarity. Nevertheless, the results also indicate that even with Gemini 2.5 Pro, there remains substantial room for improvement compared to WebArena, highlighting the increased challenges posed by WebChoreArena.

WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks

TL;DR

WebChoreArena extends the WebArena benchmark with 532 meticulously curated tasks to push web-browsing agents beyond general browsing into memory-, calculation-, and long-term memory–driven chores. Built on the same simulation framework as WebArena, it enables reproducible, fair comparisons and clearer differentiation among large language model–based agents. Experimental results show substantial performance gaps for GPT-4o and meaningful yet incomplete improvements for Claude 3.7 Sonnet and Gemini 2.5 Pro, underscoring the remaining challenges in real-world-like tedious tasks. The benchmark highlights the importance of memory management, task design, and model–agent interactions, and points to future work in extending to online environments and more diverse UI domains.

Abstract

Powered by a large language model (LLM), a web browsing agent operates web browsers in a human-like manner and offers a highly transparent path toward automating a wide range of everyday tasks. As web agents become increasingly capable and demonstrate proficiency in general browsing tasks, a critical question emerges: Can they go beyond general browsing to robustly handle tasks that are tedious and complex, or chores that humans often avoid doing themselves? In this paper, we introduce WebChoreArena, a new fully reproducible benchmark comprising 532 carefully curated tasks designed to extend the scope of WebArena beyond general browsing to more labor-intensive and tedious tasks. WebChoreArena systematically integrates three key challenges: (i) Massive Memory tasks requiring accurate retrieval of large amounts of information in the observations, (ii) Calculation tasks demanding precise mathematical reasoning, and (iii) Long-Term Memory tasks necessitating long-term memory across multiple webpages. Built on top of the fully reproducible and widely adopted four WebArena simulation environments, WebChoreArena ensures strict reproducibility and enables fair, direct comparisons with the established WebArena benchmark, offering key insights into agent progress. Our experimental results demonstrate that as LLMs evolve, represented by GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro, significant improvements in performance are observed on WebChoreArena. These findings suggest that WebChoreArena is well-suited to measure the advancement of state-of-the-art LLMs with greater clarity. Nevertheless, the results also indicate that even with Gemini 2.5 Pro, there remains substantial room for improvement compared to WebArena, highlighting the increased challenges posed by WebChoreArena.

Paper Structure

This paper contains 36 sections, 11 figures, 7 tables.

Figures (11)

  • Figure 1: The WebChoreArena challenge. WebChoreArena extends WebArena by introducing more complex and labor-intensive tasks, pushing the boundaries of agent capabilities. This enhanced benchmark allows for a clearer evaluation of progress in advanced models and reveals that even powerful models such as Gemini 2.5 Pro still have significant room for improvement.
  • Figure 2: Distribution of websites in WebChoreArena
  • Figure 3: Distribution of task types in WebChoreArena
  • Figure 5: Examples in each task type in WebChoreArena. (i) Massive Memory tasks require accurately memorizing a large amount of information from the given page. (ii) Calculation tasks involve performing arithmetic operations. (iii) Long-Term Memory tasks require the agent to retain relevant information across many steps and interactions. (iv) Others involve tasks that require special or domain-specific operations.
  • Figure 6: Comparison across different task types. This result reveals that the methodology of the agent itself has a substantial impact on its effectiveness across different task types.
  • ...and 6 more figures