Table of Contents
Fetching ...

Experiential Reinforcement Learning

Taiwei Shi, Sihao Chen, Bowen Jiang, Linxin Song, Longqi Yang, Jieyu Zhao

TL;DR

ERL introduces an explicit experience–reflection–consolidation loop to reinforcement learning for language-model agents operating under sparse and delayed rewards. By generating an initial attempt, reflecting on its outcome, and producing a refined second attempt guided by the reflection, ERL creates a structured revision that is reinforced and internalized into the base policy via memory and distillation. The approach yields faster learning and stronger final policies across control and reasoning tasks, with notable gains in Sokoban and FrozenLake and more modest gains in HotpotQA, while maintaining deployable performance without reflection. This work demonstrates that embedding explicit experiential revision within RL trajectories can transform feedback into durable behavioral improvements, offering a practical pathway toward experience-grounded, self-improving agents.

Abstract

Reinforcement learning has become the central approach for language models (LMs) to learn from environmental reward or feedback. In practice, the environmental feedback is usually sparse and delayed. Learning from such signals is challenging, as LMs must implicitly infer how observed failures should translate into behavioral changes for future iterations. We introduce Experiential Reinforcement Learning (ERL), a training paradigm that embeds an explicit experience-reflection-consolidation loop into the reinforcement learning process. Given a task, the model generates an initial attempt, receives environmental feedback, and produces a reflection that guides a refined second attempt, whose success is reinforced and internalized into the base policy. This process converts feedback into structured behavioral revision, improving exploration and stabilizing optimization while preserving gains at deployment without additional inference cost. Across sparse-reward control environments and agentic reasoning benchmarks, ERL consistently improves learning efficiency and final performance over strong reinforcement learning baselines, achieving gains of up to +81% in complex multi-step environments and up to +11% in tool-using reasoning tasks. These results suggest that integrating explicit self-reflection into policy training provides a practical mechanism for transforming feedback into durable behavioral improvement.

Experiential Reinforcement Learning

TL;DR

ERL introduces an explicit experience–reflection–consolidation loop to reinforcement learning for language-model agents operating under sparse and delayed rewards. By generating an initial attempt, reflecting on its outcome, and producing a refined second attempt guided by the reflection, ERL creates a structured revision that is reinforced and internalized into the base policy via memory and distillation. The approach yields faster learning and stronger final policies across control and reasoning tasks, with notable gains in Sokoban and FrozenLake and more modest gains in HotpotQA, while maintaining deployable performance without reflection. This work demonstrates that embedding explicit experiential revision within RL trajectories can transform feedback into durable behavioral improvements, offering a practical pathway toward experience-grounded, self-improving agents.

Abstract

Reinforcement learning has become the central approach for language models (LMs) to learn from environmental reward or feedback. In practice, the environmental feedback is usually sparse and delayed. Learning from such signals is challenging, as LMs must implicitly infer how observed failures should translate into behavioral changes for future iterations. We introduce Experiential Reinforcement Learning (ERL), a training paradigm that embeds an explicit experience-reflection-consolidation loop into the reinforcement learning process. Given a task, the model generates an initial attempt, receives environmental feedback, and produces a reflection that guides a refined second attempt, whose success is reinforced and internalized into the base policy. This process converts feedback into structured behavioral revision, improving exploration and stabilizing optimization while preserving gains at deployment without additional inference cost. Across sparse-reward control environments and agentic reasoning benchmarks, ERL consistently improves learning efficiency and final performance over strong reinforcement learning baselines, achieving gains of up to +81% in complex multi-step environments and up to +11% in tool-using reasoning tasks. These results suggest that integrating explicit self-reflection into policy training provides a practical mechanism for transforming feedback into durable behavioral improvement.
Paper Structure (25 sections, 10 equations, 7 figures, 9 tables, 2 algorithms)

This paper contains 25 sections, 10 equations, 7 figures, 9 tables, 2 algorithms.

Figures (7)

  • Figure 1: In Experiential Reinforcement Learning (ERL), instead of learning from feedback or outcome directly, an agent learns to (1) verbally reflect on its experience and observed outcome, and (2) internalize the reflections to induce behavioral changes in future iterations.
  • Figure 2: Conceptual comparison of learning dynamics in RLVR and Experiential Reinforcement Learning (ERL). RLVR relies on repeated trial-and-error driven by scalar rewards, leading to back-and-forth exploration without durable correction. ERL augments this process with an experience–reflection–consolidation loop that generates a revised attempt and internalizes successful corrections, enabling persistent behavioral improvement.
  • Figure 3: Overview of Experiential Reinforcement Learning (ERL). Given an input task $x$, the language model first produces an initial attempt and receives environment feedback. The same model then generates a self-reflection conditioned on this attempt, which is used to guide a second attempt. Both attempts and reflections are optimized with reinforcement learning, while successful second attempts are internalized via self-distillation, so the model learns to reproduce improved behavior directly from the original input without self-reflection.
  • Figure 4: Validation reward trajectories versus training wall-clock time on FrozenLake, HotpotQA, and Sokoban for Qwen3-4B-Instruct-2507 and Olmo-3-7B-Instruct. ERL consistently achieves higher reward and faster improvement than RLVR across tasks and models.
  • Figure 5: Final evaluation reward on FrozenLake, HotpotQA, and Sokoban. ERL consistently outperforms RLVR for both Qwen3-4B-Instruct-2507 and Olmo-3-7B-Instruct.
  • ...and 2 more figures