Table of Contents
Fetching ...

Internalizing Agency from Reflective Experience

Rui Ge, Yichao Fu, Yuyang Qian, Junda Su, Yiming Zhao, Peng Zhao, Hao Zhang

Abstract

Large language models are increasingly deployed as autonomous agents that must plan, act, and recover from mistakes through long-horizon interaction with environments that provide rich feedback. However, prevailing outcome-driven post-training methods (e.g., RL with verifiable rewards) primarily optimize final success signals, leaving rich environment feedback underutilized. Consequently, they often lead to distribution sharpening: the policy becomes better at reproducing a narrow set of already-successful behaviors, while failing to improve the feedback-grounded agency needed to expand problem-solving capacity (e.g., Pass@k) in long-horizon settings. To address this, we propose LEAFE (Learning Feedback-Grounded Agency from Reflective Experience), a framework that internalizes recovery agency from reflective experience. Specifically, during exploration, the agent summarizes environment feedback into actionable experience, backtracks to earlier decision points, and explores alternative branches with revised actions. We then distill these experience-guided corrections into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions. Across a diverse set of interactive coding and agentic tasks under fixed interaction budgets, LEAFE consistently improves Pass@1 over the base model and achieves higher Pass@k than outcome-driven baselines (GRPO) and experience-based methods such as Early Experience, with gains of up to 14% on Pass@128.

Internalizing Agency from Reflective Experience

Abstract

Large language models are increasingly deployed as autonomous agents that must plan, act, and recover from mistakes through long-horizon interaction with environments that provide rich feedback. However, prevailing outcome-driven post-training methods (e.g., RL with verifiable rewards) primarily optimize final success signals, leaving rich environment feedback underutilized. Consequently, they often lead to distribution sharpening: the policy becomes better at reproducing a narrow set of already-successful behaviors, while failing to improve the feedback-grounded agency needed to expand problem-solving capacity (e.g., Pass@k) in long-horizon settings. To address this, we propose LEAFE (Learning Feedback-Grounded Agency from Reflective Experience), a framework that internalizes recovery agency from reflective experience. Specifically, during exploration, the agent summarizes environment feedback into actionable experience, backtracks to earlier decision points, and explores alternative branches with revised actions. We then distill these experience-guided corrections into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions. Across a diverse set of interactive coding and agentic tasks under fixed interaction budgets, LEAFE consistently improves Pass@1 over the base model and achieves higher Pass@k than outcome-driven baselines (GRPO) and experience-based methods such as Early Experience, with gains of up to 14% on Pass@128.
Paper Structure (42 sections, 8 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 42 sections, 8 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Internalizing feedback-grounded agency improves model capability (i.e., Pass@K) in long-horizon interaction, while outcome-only training (e.g., GRPO) yields limited gains beyond the base model.
  • Figure 2: Illustration of the LEAFE framework. Stage 1: During experience collection, the assistant periodically reviews the current trajectory and identifies a suboptimal round (denoted as red-colored $\tau$). It then produces the actionable experience $e$, which is concatenated with the restored history to facilitate subsequent attempts. Stage 2: During experience distillation, the model optimizes a joint loss using two datasets: randomly sampled rehearsal pairs to maintain capabilities, and counterfactual pairs (original prompts paired with experience-improved actions) to internalize diverse exploration. For simplicity, we depict one branching event from the rollback exploration tree.
  • Figure 3: Scaling results on different benchmarks. We plot the Pass@$k$ success rate as a function of the number of samples $k$. Our method (red) consistently achieves higher efficiency and performance ceilings across all tasks compared to the baselines.
  • Figure 4: Main results on WebShop and SciWorld. Bars represent Pass@1 (solid) and Pass@k (hatched) (%). Ours consistently outperforms GRPO and other baselines across different model architectures and scales.
  • Figure 5: A example on Sokoban illustrating Stage 1 of LEAFE. Starting from a failed trajectory, the agent reflects on the interaction history, identifies an earlier suboptimal decision(step 3), and generates a compact experience summary for rollback-based revision. The environment is then reset to the selected step, the prior history is replayed(step 1-2), and a new branch is explored under the guidance of the reflected experience. Repeating this failure → reflection → rollback → correction process enables the agent to recover from early mistakes and eventually reach a successful solution.