Table of Contents
Fetching ...

Scaling Agent Learning via Experience Synthesis

Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, Dat Huynh

TL;DR

The paper addresses the high cost, limited diversity, and unreliable rewards of real-environment RL for training general-purpose agents. It introduces DreamGym, a unified framework that synthesizes diverse, reasoning-grounded experiences via a meta-reasoning experience model, an experience replay buffer, and a curriculum-driven task generator to enable scalable online RL and efficient sim-to-real transfer. Empirical results show DreamGym delivers substantial gains on non-RL-ready tasks, matches or surpasses RL baselines in RL-ready tasks with purely synthetic data, and provides strong warm-start benefits for real-world learning with limited data. A theoretical analysis bounds policy improvement in real environments when trained with synthetic experiences, highlighting that reward accuracy and domain-consistent transitions—not exact state fidelity—drive transfer success. Overall, DreamGym offers a scalable, data-efficient path to training versatile agents with reduced real-world interaction costs and improved generalization across domains.

Abstract

While reinforcement learning (RL) can empower autonomous agents by enabling self-improvement through interaction, its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity, all of which obstruct the collection of scalable experience data. To address these challenges, we introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind to enable effective online RL training for autonomous agents. Rather than relying on expensive real-environment rollouts, DreamGym distills environment dynamics into a reasoning-based experience model that derives consistent state transitions and feedback signals through step-by-step reasoning, enabling scalable agent rollout collection for RL. To improve the stability and quality of transitions, DreamGym leverages an experience replay buffer initialized with offline real-world data and continuously enriched with fresh interactions to actively support agent training. To improve knowledge acquisition, DreamGym adaptively generates new tasks that challenge the current agent policy, enabling more effective online curriculum learning. Experiments across diverse environments and agent backbones demonstrate that DreamGym substantially improves RL training, both in fully synthetic settings and in sim-to-real transfer scenarios. On non-RL-ready tasks like WebArena, DreamGym outperforms all baselines by over 30%. And in RL-ready but costly settings, it matches GRPO and PPO performance using only synthetic interactions. When transferring a policy trained purely on synthetic experiences to real-environment RL, DreamGym yields significant additional performance gains while requiring far fewer real-world interactions, providing a scalable warm-start strategy for general-purpose RL.

Scaling Agent Learning via Experience Synthesis

TL;DR

The paper addresses the high cost, limited diversity, and unreliable rewards of real-environment RL for training general-purpose agents. It introduces DreamGym, a unified framework that synthesizes diverse, reasoning-grounded experiences via a meta-reasoning experience model, an experience replay buffer, and a curriculum-driven task generator to enable scalable online RL and efficient sim-to-real transfer. Empirical results show DreamGym delivers substantial gains on non-RL-ready tasks, matches or surpasses RL baselines in RL-ready tasks with purely synthetic data, and provides strong warm-start benefits for real-world learning with limited data. A theoretical analysis bounds policy improvement in real environments when trained with synthetic experiences, highlighting that reward accuracy and domain-consistent transitions—not exact state fidelity—drive transfer success. Overall, DreamGym offers a scalable, data-efficient path to training versatile agents with reduced real-world interaction costs and improved generalization across domains.

Abstract

While reinforcement learning (RL) can empower autonomous agents by enabling self-improvement through interaction, its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity, all of which obstruct the collection of scalable experience data. To address these challenges, we introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind to enable effective online RL training for autonomous agents. Rather than relying on expensive real-environment rollouts, DreamGym distills environment dynamics into a reasoning-based experience model that derives consistent state transitions and feedback signals through step-by-step reasoning, enabling scalable agent rollout collection for RL. To improve the stability and quality of transitions, DreamGym leverages an experience replay buffer initialized with offline real-world data and continuously enriched with fresh interactions to actively support agent training. To improve knowledge acquisition, DreamGym adaptively generates new tasks that challenge the current agent policy, enabling more effective online curriculum learning. Experiments across diverse environments and agent backbones demonstrate that DreamGym substantially improves RL training, both in fully synthetic settings and in sim-to-real transfer scenarios. On non-RL-ready tasks like WebArena, DreamGym outperforms all baselines by over 30%. And in RL-ready but costly settings, it matches GRPO and PPO performance using only synthetic interactions. When transferring a policy trained purely on synthetic experiences to real-environment RL, DreamGym yields significant additional performance gains while requiring far fewer real-world interactions, providing a scalable warm-start strategy for general-purpose RL.

Paper Structure

This paper contains 48 sections, 2 theorems, 20 equations, 6 figures, 2 tables.

Key Result

Theorem 1

Let the real MDP be $\mathcal{M}=(\mathcal{S},\mathcal{A},P,R,\gamma)$, the synthetic MDP induced by $\mathcal{M}_{\mathrm{exp}}$ be $\widehat{\mathcal{M}}=(\mathcal{S},\mathcal{A},\widehat{P},\widehat{R},\gamma)$, discount be $\gamma\in(0,1)$, and let rewards be bounded $R,\widehat{R}\in[0,R_{\max} and a trust-region update $\pi\!\to\!\pi'$ obtained by optimizing in $\widehat{\mathcal{M}}$ with p

Figures (6)

  • Figure 1: Compared to the traditional agent learning paradigm, DreamGym provides the first scalable and effective RL framework with unified infrastructure.
  • Figure 2: Overview of the proposed DreamGym agent training framework. Given a set of seed tasks, a reasoning-based experience model interacts with the agent to generate informative, diverse tasks and trajectories for RL training. At each step, the agent takes actions based on its current state and receives next states and reward signals derived by the experience model through CoT reasoning based on both interaction history and top-$k$ similar experiences from an active replay buffer. To expose the agent to increasingly informative scenarios, tasks with high reward entropy are proposed by the curriculum task generator for future training. With this unified design, DreamGym addresses both task and reward sparsity while enabling scalable RL with diverse and curriculum-driven environments.
  • Figure 3: (1) Left: Comparing the agent performance (success rate %) on WebArena zhouwebarena w.r.t. total training time across different training strategies and backbones. (2) Middle: Evaluating the cross-domain transferability of the agent policy trained via DreamGym with seed tasks from a different environment. (3) Right: Comparing the agent performance on WebShop yao2022webshop w.r.t. number of training steps across different training strategies.
  • Figure 4: Evaluation of the experience model across key criteria using GPT-4o as the judge. We randomly sample 100 trajectories and prompt the model to assign discrete scores in $\{0,1,2\}$ across four criteria, as detailed in Appendix \ref{['prompt:judge']}.
  • Figure 5: Evaluation of the experience model across different number of offline training data size (transition step) and backbone.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Theorem 1: Policy Improvement $J$ in Real Environment via Synthetic Experiences
  • proof : Proof of Theorem \ref{['thm:real-improve']}
  • Lemma 1: Multi-turn experience synthesis error bound
  • proof