Table of Contents
Fetching ...

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, Yuxiong He

TL;DR

AWM addresses the critical need for diverse, scalable training grounds for agentic reinforcement learning by automatically synthesizing $1{,}000$ executable, database-backed environments with a unified MCP interface. The pipeline decomposes generation into Scenario Generation, Task Generation, and Environment Synthesis, with robust, code-augmented verification to produce reliable reward signals. Empirical results across three benchmarks show strong out-of-distribution generalization when training exclusively in synthetic environments, with clear advantages over LLM-based simulation and competing synthesis methods. The work delivers a practical, open-source resource that supports large-scale RL for tool-use agents while highlighting areas for future self-evolution and optimization.

Abstract

Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

TL;DR

AWM addresses the critical need for diverse, scalable training grounds for agentic reinforcement learning by automatically synthesizing executable, database-backed environments with a unified MCP interface. The pipeline decomposes generation into Scenario Generation, Task Generation, and Environment Synthesis, with robust, code-augmented verification to produce reliable reward signals. Empirical results across three benchmarks show strong out-of-distribution generalization when training exclusively in synthetic environments, with clear advantages over LLM-based simulation and competing synthesis methods. The work delivers a practical, open-source resource that supports large-scale RL for tool-use agents while highlighting areas for future self-evolution and optimization.

Abstract

Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.
Paper Structure (29 sections, 3 equations, 30 figures, 11 tables)

This paper contains 29 sections, 3 equations, 30 figures, 11 tables.

Figures (30)

  • Figure 1: Agent World Model (AWM) is a synthetic environment generation pipeline that synthesizes 1,000 diverse code-driven agentic environments with databases for training tool-use agents.
  • Figure 2: Overview of AWM. Starting from scenario synthesis, we progressively generate tasks, database, interface and verification to obtain fully executable environments. Then, we perform multi-turn RL training for tool-use agents in our synthesized environments.
  • Figure 3: Diversity analysis of 1,000 synthesized environments. (a) Embedding diversity is calculated by encoding the scenario description, database schema and toolset schema. (b) Category coverage counts the number of unique topics of scenarios.
  • Figure 4: Scaling of AWM over environment sizes with 4B model.
  • Figure 5: Format error ratio comparison of AWM. "w/o Format" means disabling the step-level format correctness reward.
  • ...and 25 more figures