Table of Contents
Fetching ...

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, Huaisong Zhang, Xian Wu, Yi Lu, Minyi Lei, Kai Zou, Huifeng Yin, Ping Nie, Liang Chen, Dongfu Jiang, Wenhu Chen, Kelsey R. Allen

Abstract

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Abstract

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.

Paper Structure

This paper contains 27 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: ClawBench overview. Left: 153 tasks across 15 life categories. Middle: existing benchmarks evaluate agents in offline sandboxes with static HTML and fixed DOM structures; ClawBench evaluates on live websites with real-world complexity and provides rich, traceable verdicts via an agentic evaluator. Right: Claude-Sonnet-4.6 and GPT-5.4 achieve 65-75% task completion on established benchmarks such as OSWorld and WebArena, but only 33.3% and 6.5%, respectively, on ClawBench, highlighting the difficulty of real-world everyday web tasks.
  • Figure 2: Main results: success rate on ClawBench for 7 frontier models. Even the strongest model (Claude Sonnet 4.6) completes only 33.3% of tasks, while two of seven models score below 5%. See Table \ref{['tab:main_results']} for the per-category breakdown.
  • Figure 3: The ClawBench evaluation pipeline. Setup: a human-authored task with explicit verification conditions. Execution: the agent operates in a real browser while five layers of behavioral data are recorded. Evaluation: the recorded trajectory is scored against a human ground-truth trajectory via an Agentic Evaluator, producing a binary pass/fail verdict with step-level justification.
  • Figure 4: Task taxonomy of ClawBench. Inner ring: 8 high-level category groups; outer ring: 15 fine-grained categories. The dataset spans 153 tasks across diverse real-world domains.
  • Figure 5: Benchmark saturation comparison. Claude-Sonnet-4.6 performs substantially better on existing web-agent benchmarks than on ClawBench, indicating that ClawBench remains challenging for frontier agents.
  • ...and 2 more figures