Table of Contents
Fetching ...

WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions

Sanjari Srivastava, Gang Li, Cheng Chang, Rishu Garg, Manpreet Kaur, Charlene Y. Lee, Yuezhang Li, Yining Mao, Ignacio Cases, Yanan Xie, Peng Qi

TL;DR

WARC-Bench introduces GUI subtasks and a Web Archive-based benchmark to evaluate short-horizon interactions on realistic web pages. It provides sandboxed environments, deterministic evaluators, and a mix of real and synthetic websites to stress-test subtask completion. The study shows that leading frontier models struggle on subtasks, while Subtask Vision Agent (SVA) and training with supervised fine-tuning (SFT) followed by reinforcement learning with verifiable rewards (RLVR) yield strong results among open-source models, narrowing the gap with closed models. The benchmark highlights the importance of subtask mastery as a prerequisite for robust web navigation and offers scalable, extensible environments for advancing GUI agent research with practical implications for real-world web automation and planning.

Abstract

Training web agents to navigate complex, real-world websites requires them to master $\textit{subtasks}$ - short-horizon interactions on multiple UI components (e.g., choosing the correct date in a date picker, or scrolling in a container to extract information). We introduce WARC-Bench (Web Archive Benchmark), a novel web navigation benchmark featuring 438 tasks designed to evaluate multimodal AI agents on subtasks. WARC-Bench enables sandboxed interactions with dynamic and realistic webpages using Web ARChive files. We show that WARC-Bench is challenging for leading computer-use models, with the highest observed success rate being 64.8%. To improve open source models on subtask, we explore two common training techniques: supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). Experiments show that SFT models obtain a 48.8% success rate on the benchmark. Training with RLVR over SFT checkpoints, even in data-scarce settings, improves the score to 52.8% on WARC-Bench, outperforming many frontier models. Our analysis concludes that mastering these subtasks is essential for robust web planning and navigation, and is a capability not extensively evaluated by existing benchmarks.

WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions

TL;DR

WARC-Bench introduces GUI subtasks and a Web Archive-based benchmark to evaluate short-horizon interactions on realistic web pages. It provides sandboxed environments, deterministic evaluators, and a mix of real and synthetic websites to stress-test subtask completion. The study shows that leading frontier models struggle on subtasks, while Subtask Vision Agent (SVA) and training with supervised fine-tuning (SFT) followed by reinforcement learning with verifiable rewards (RLVR) yield strong results among open-source models, narrowing the gap with closed models. The benchmark highlights the importance of subtask mastery as a prerequisite for robust web navigation and offers scalable, extensible environments for advancing GUI agent research with practical implications for real-world web automation and planning.

Abstract

Training web agents to navigate complex, real-world websites requires them to master - short-horizon interactions on multiple UI components (e.g., choosing the correct date in a date picker, or scrolling in a container to extract information). We introduce WARC-Bench (Web Archive Benchmark), a novel web navigation benchmark featuring 438 tasks designed to evaluate multimodal AI agents on subtasks. WARC-Bench enables sandboxed interactions with dynamic and realistic webpages using Web ARChive files. We show that WARC-Bench is challenging for leading computer-use models, with the highest observed success rate being 64.8%. To improve open source models on subtask, we explore two common training techniques: supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). Experiments show that SFT models obtain a 48.8% success rate on the benchmark. Training with RLVR over SFT checkpoints, even in data-scarce settings, improves the score to 52.8% on WARC-Bench, outperforming many frontier models. Our analysis concludes that mastering these subtasks is essential for robust web planning and navigation, and is a capability not extensively evaluated by existing benchmarks.

Paper Structure

This paper contains 23 sections, 1 equation, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of WARC-Bench. We record real and synthetic websites as Web Archive files to create interactive web environments for evaluating subtask execution and widget interactions in GUI Agents. WARC-Bench uses programmatic reward functions for automatic evaluation. Subtasks remain challenging for frontier models. Our model trained via SFT and RLVR achieves state-of-the-art results among open-source models.
  • Figure 2: Statistics and Distribution of Subtasks in WARC-Bench. Each task in WARC-Bench can belong to multiple subtask categories, here we illustrate the category coverage in the figure on the right.
  • Figure 3: Diagram of the Subtask Vision Agent (SVA) design.
  • Figure 4: Behavioral analysis of Ours-72B-SFT (SFT) v/s Ours-72B-RLVR (RLVR) model
  • Figure 5: Accuracy-Latency Tradeoffs -- baseline computer-use agents v/s our models
  • ...and 1 more figures