Table of Contents
Fetching ...

Agent Learning via Early Experience

Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, Sara Cao, Lawrence Jang, Shuyan Zhou, Jiacheng Zhu, Huan Sun, Jason Weston, Yu Su, Yifan Wu

TL;DR

Early Experience presents a practical, reward-free paradigm to improve language agents by learning from the consequences of their own actions. By formulating two strategies—implicit world modeling and self-reflection—the authors convert action-driven future states into scalable supervision, enabling improvements in both effectiveness and out-of-domain generalization across diverse environments. The approach serves as a robust bridge that enhances imitation learning and provides a strong initialization for subsequent reinforcement learning, with demonstrated gains across multiple model families and tasks. The work suggests a scalable path toward more capable agents that can continually improve from their own experience, even in reward-sparse real-world settings.

Abstract

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.

Agent Learning via Early Experience

TL;DR

Early Experience presents a practical, reward-free paradigm to improve language agents by learning from the consequences of their own actions. By formulating two strategies—implicit world modeling and self-reflection—the authors convert action-driven future states into scalable supervision, enabling improvements in both effectiveness and out-of-domain generalization across diverse environments. The approach serves as a robust bridge that enhances imitation learning and provides a strong initialization for subsequent reinforcement learning, with demonstrated gains across multiple model families and tasks. The work suggests a scalable path toward more capable agents that can continually improve from their own experience, even in reward-sparse real-world settings.

Abstract

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.

Paper Structure

This paper contains 31 sections, 4 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Progression of training paradigms for language agents. Left: The Era of Human Data relies on expert demonstrations, where supervision comes from human-/expert-curated actions; it is reward-free (i.e., does not require the environment to provide verifiable reward) but not data-scalable. Right: The envisioned Era of Experience builds upon environments with verifiable rewards, using them as the primary supervision for reinforcement learning; however, many environments either lack such rewards Xue2025online-m2w or require inefficient long-horizon rollouts Xie2024TravelPlanner. Center: Our Early Experience paradigm enables agents to propose actions and collect the resulting future states, using them as a scalable and reward-free source of supervision.
  • Figure 2: Overview of the two early experience approaches. Implicit world modeling (left) augments expert trajectories with alternative actions and predicted next states, training the policy to internalize transition dynamics before deployment. Self-reflection (right) augments expert actions with self-generated explanations $c_1$, training the policy to reason about and revise its own decisions. Both methods use alternative actions proposed by the initial policy (LLM). The number of alternatives ($K$) is a hyperparameter; for brevity, only one is illustrated.
  • Figure 3: Reinforcement learning (GRPO) starting from checkpoints trained with different methods on three infra-ready environments. Bars show performance before (deeper shade) and after RL (lighter shade) for three methods. Checkpoints from early-experience methods (IWM, SR) consistently lead to higher post-RL ceilings than imitation-only starts, with advantages often maintained or amplified after RL.
  • Figure 4: Effect of demonstration budget and branching factor. (a): success rate vs. fraction of expert trajectories; (b): success rate vs. branching factor $K$ (number of alternative actions per state in $\mathcal{D}_\text{expert}$). Results are shown for WebShop and ALFWorld using Llama-3.1-8B-Instruct.
  • Figure 5: Performance of Llama with different model sizes trained with imitation learning and methods under early experience on the WebArena-Lite benchmark.