WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

Sicheng Fan; Rui Wan; Yifei Leng; Gaoning Liang; Li Ling; Yanyi Shang; Dehan Kong

WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

Sicheng Fan, Rui Wan, Yifei Leng, Gaoning Liang, Li Ling, Yanyi Shang, Dehan Kong

TL;DR

This work introduces WebChain, the largest open-source dataset of human-annotated trajectories on real-world websites, and proposes a Dual Mid-Training recipe that decouples spatial grounding from planning, achieving state-of-the-art performance on the proposed WebChainBench and other public GUI benchmarks.

Abstract

We introduce WebChain, the largest open-source dataset of human-annotated trajectories on real-world websites, designed to accelerate reproducible research in web agents. It contains 31,725 trajectories and 318k steps, featuring a core Triple Alignment of visual, structural, and action data to provide rich, multi-modal supervision. The data is collected via a scalable pipeline that ensures coverage of complex, high-value tasks often missed by synthetic methods. Leveraging this dataset, we propose a Dual Mid-Training recipe that decouples spatial grounding from planning, achieving state-of-the-art performance on our proposed WebChainBench and other public GUI benchmarks. Our work provides the data and insights necessary to build and rigorously evaluate the next generation of scalable web agents.

WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

TL;DR

Abstract

Paper Structure (32 sections, 3 equations, 6 figures, 5 tables)

This paper contains 32 sections, 3 equations, 6 figures, 5 tables.

Introduction
Related Work
Web Interaction Datasets and Benchmarks
Vision-Language Models for GUI Grounding
Training Paradigms: From SFT to RL
The WebChain Dataset
Overview
Data Construction Pipeline
Stage 1: Constraint-Based Task Synthesis
Structured Functionality Extraction.
Schema-Constrained Task Generation.
Stage 2: Human-in-the-Loop Trajectory Collection
Stage 3: Post-processing Contextual Enrichment
Visual Grounding Densification.
Synthetic Rationale Generation (CoT).
...and 17 more sections

Figures (6)

Figure 1: Dataset Overview. A statistical summary of WebChain, including interaction distribution across website categories, top-domain interaction frequencies, device pixel ratio distribution, trajectory complexity, and trajectory duration. These statistics collectively highlight the scale and diversity of WebChain.
Figure 2: Example trajectory and multi-dimensional step information in WebChain. Left: a long-horizon task on Booking.com with key steps along the trajectory. Right: the multi-dimensional step schema, including visual observations, structural semantics, and behavioral annotations.
Figure 3: Scaling effects of $\text{WebChain}$ subsets (4k, 20k, and Full) on Qwen2.5-VL-3B's success rate after LCRL post-training.
Figure 4: Study on WCB-S evaluating the effects of Visual Grounding Densification (VGD) and Reasoner Prompting (RP) on spatial grounding performance.
Figure 5: Study on WCB-L evaluating the effect of SGRL Mid-Training on LCRL Post-Training.
...and 1 more figures

WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

TL;DR

Abstract

WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

Authors

TL;DR

Abstract

Table of Contents

Figures (6)