Table of Contents
Fetching ...

WebWorld: A Large-Scale World Model for Web Agent Training

Zikai Xiao, Jianhong Tu, Chuhang Zou, Yuxin Zuo, Zhi Li, Peng Wang, Bowen Yu, Fei Huang, Junyang Lin, Zuozhu Liu

TL;DR

WebWorld presents a large-scale open-web world model trained on over 1M real-world trajectories to enable long-horizon, multi-format web simulations. It introduces a scalable hierarchical data pipeline, an intrinsic WebWorld-Bench for evaluation, and demonstrates strong extrinsic gains when training agents on WebWorld-synthesized data, including effective inference-time search. The work shows scalable improvements with model size, active reasoning activation via limited CoT data, and cross-domain generalization to code, GUI, and games. Collectively, WebWorld provides a replicable recipe for building web-grounded world models with practical impact for offline agent training and robust generalization.

Abstract

Web agents require massive trajectories to generalize, yet real-world training is constrained by network latency, rate limits, and safety risks. We introduce \textbf{WebWorld} series, the first open-web simulator trained at scale. While existing simulators are restricted to closed environments with thousands of trajectories, WebWorld leverages a scalable data pipeline to train on 1M+ open-web interactions, supporting reasoning, multi-format data, and long-horizon simulations of 30+ steps. For intrinsic evaluation, we introduce WebWorld-Bench with dual metrics spanning nine dimensions, where WebWorld achieves simulation performance comparable to Gemini-3-Pro. For extrinsic evaluation, Qwen3-14B trained on WebWorld-synthesized trajectories improves by +9.2\% on WebArena, reaching performance comparable to GPT-4o. WebWorld enables effective inference-time search, outperforming GPT-5 as a world model. Beyond web simulation, WebWorld exhibits cross-domain generalization to code, GUI, and game environments, providing a replicable recipe for world model construction.

WebWorld: A Large-Scale World Model for Web Agent Training

TL;DR

WebWorld presents a large-scale open-web world model trained on over 1M real-world trajectories to enable long-horizon, multi-format web simulations. It introduces a scalable hierarchical data pipeline, an intrinsic WebWorld-Bench for evaluation, and demonstrates strong extrinsic gains when training agents on WebWorld-synthesized data, including effective inference-time search. The work shows scalable improvements with model size, active reasoning activation via limited CoT data, and cross-domain generalization to code, GUI, and games. Collectively, WebWorld provides a replicable recipe for building web-grounded world models with practical impact for offline agent training and robust generalization.

Abstract

Web agents require massive trajectories to generalize, yet real-world training is constrained by network latency, rate limits, and safety risks. We introduce \textbf{WebWorld} series, the first open-web simulator trained at scale. While existing simulators are restricted to closed environments with thousands of trajectories, WebWorld leverages a scalable data pipeline to train on 1M+ open-web interactions, supporting reasoning, multi-format data, and long-horizon simulations of 30+ steps. For intrinsic evaluation, we introduce WebWorld-Bench with dual metrics spanning nine dimensions, where WebWorld achieves simulation performance comparable to Gemini-3-Pro. For extrinsic evaluation, Qwen3-14B trained on WebWorld-synthesized trajectories improves by +9.2\% on WebArena, reaching performance comparable to GPT-4o. WebWorld enables effective inference-time search, outperforming GPT-5 as a world model. Beyond web simulation, WebWorld exhibits cross-domain generalization to code, GUI, and game environments, providing a replicable recipe for world model construction.
Paper Structure (40 sections, 4 equations, 18 figures, 15 tables)

This paper contains 40 sections, 4 equations, 18 figures, 15 tables.

Figures (18)

  • Figure 1: Overview of WebWorld. WebWorld is a large-scale world model for the open web, trained on over 1M real-world trajectories. It supports long-horizon, multi-format simulation, enabling agents trained with its data to achieve significant performance gains.
  • Figure 2: WebWorld Example. Left: Agent. Right: WebWorld.
  • Figure 3: Statistics of the WebWorld Dataset.(a) Diverse coverage across domains like Lifestyle, Tech, and Education. (b) Token distribution showing the model's exposure to varying context lengths. (c) Interaction turns distribution confirming the inclusion of long-horizon tasks (up to 30+ steps).
  • Figure 4: Scaling Law of WebWorld. Larger models achieve lower eval loss. Stars indicate predictions for the 72B model, suggesting continued performance gains with model scaling.
  • Figure 5: Domain and Source Distribution of WebWorld Training Data. The chart illustrates the composition of our trajectory dataset, which contains over one million samples, across 15 distinct data sources. The colors represent different semantic domains (e.g., Technology, E-Commerce), showing that our data collection pipelines significantly contribute to the diversity of open-domain topics compared to traditional web generation methods.
  • ...and 13 more figures