Table of Contents
Fetching ...

WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

Sicheng Fan, Rui Wan, Yifei Leng, Gaoning Liang, Li Ling, Yanyi Shang, Dehan Kong

TL;DR

This work introduces WebChain, the largest open-source dataset of human-annotated trajectories on real-world websites, and proposes a Dual Mid-Training recipe that decouples spatial grounding from planning, achieving state-of-the-art performance on the proposed WebChainBench and other public GUI benchmarks.

Abstract

We introduce WebChain, the largest open-source dataset of human-annotated trajectories on real-world websites, designed to accelerate reproducible research in web agents. It contains 31,725 trajectories and 318k steps, featuring a core Triple Alignment of visual, structural, and action data to provide rich, multi-modal supervision. The data is collected via a scalable pipeline that ensures coverage of complex, high-value tasks often missed by synthetic methods. Leveraging this dataset, we propose a Dual Mid-Training recipe that decouples spatial grounding from planning, achieving state-of-the-art performance on our proposed WebChainBench and other public GUI benchmarks. Our work provides the data and insights necessary to build and rigorously evaluate the next generation of scalable web agents.

WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

TL;DR

This work introduces WebChain, the largest open-source dataset of human-annotated trajectories on real-world websites, and proposes a Dual Mid-Training recipe that decouples spatial grounding from planning, achieving state-of-the-art performance on the proposed WebChainBench and other public GUI benchmarks.

Abstract

We introduce WebChain, the largest open-source dataset of human-annotated trajectories on real-world websites, designed to accelerate reproducible research in web agents. It contains 31,725 trajectories and 318k steps, featuring a core Triple Alignment of visual, structural, and action data to provide rich, multi-modal supervision. The data is collected via a scalable pipeline that ensures coverage of complex, high-value tasks often missed by synthetic methods. Leveraging this dataset, we propose a Dual Mid-Training recipe that decouples spatial grounding from planning, achieving state-of-the-art performance on our proposed WebChainBench and other public GUI benchmarks. Our work provides the data and insights necessary to build and rigorously evaluate the next generation of scalable web agents.
Paper Structure (32 sections, 3 equations, 6 figures, 5 tables)

This paper contains 32 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Dataset Overview. A statistical summary of WebChain, including interaction distribution across website categories, top-domain interaction frequencies, device pixel ratio distribution, trajectory complexity, and trajectory duration. These statistics collectively highlight the scale and diversity of WebChain.
  • Figure 2: Example trajectory and multi-dimensional step information in WebChain. Left: a long-horizon task on Booking.com with key steps along the trajectory. Right: the multi-dimensional step schema, including visual observations, structural semantics, and behavioral annotations.
  • Figure 3: Scaling effects of $\text{WebChain}$ subsets (4k, 20k, and Full) on Qwen2.5-VL-3B's success rate after LCRL post-training.
  • Figure 4: Study on WCB-S evaluating the effects of Visual Grounding Densification (VGD) and Reasoner Prompting (RP) on spatial grounding performance.
  • Figure 5: Study on WCB-L evaluating the effect of SGRL Mid-Training on LCRL Post-Training.
  • ...and 1 more figures