Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management
Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, Jiecao Chen
TL;DR
The paper tackles the context-length bottleneck in reinforcement learning fine-tuning of LLMs for long-horizon, multi-turn tool use. It introduces a summarization-augmented MDP and a policy-gradient decomposition that enables end-to-end optimization of both tool-use behavior and summarization strategies, instantiated as SUPO (SUmmarization augmented Policy Optimization). Empirically, SUPO surpasses baselines in interactive function calling and searching tasks while maintaining the same or shorter working context, and test-time expansion of summarization rounds further boosts performance on complex tasks. This work demonstrates that principled context management via learned summaries is a scalable and effective approach to train RL agents beyond fixed context windows, with implications for broader long-horizon LLM applications.
Abstract
We study reinforcement learning (RL) fine-tuning of large language model (LLM) agents for long-horizon multi-turn tool use, where context length quickly becomes a fundamental bottleneck. Existing RL pipelines can suffer from degraded instruction following, excessive rollout costs, and most importantly, strict context limits. To address these challenges, we introduce summarization-based context management to training. In specific, it periodically compresses the tool using history by LLM-generated summaries that retain task-relevant information to keep a compact context while enabling the agent to scale beyond the fixed context window. Building on this formulation, we derive a policy gradient representation that seamlessly enables standard LLM RL infrastructures to optimize both tool-use behaviors as well as summarization strategies in an end-to-end fashion. We instantiate this framework with \underline{SU}mmarization augmented \underline{P}olicy \underline{O}ptimization (\texttt{SUPO}), an LLM RL algorithm that enables long-horizon training beyond a fixed context limit. Experiments on interactive function calling and searching tasks demonstrate that \texttt{SUPO} significantly improves the success rate while maintaining the same or even lower working context length compared to baselines. We also demonstrate that for complex searching tasks, \texttt{SUPO} can further improve the evaluation performance when scaling test-time maximum round of summarization beyond that of training time. Our results establish summarization-based context management as a principled and scalable approach for training RL agents beyond a fixed context length limit.
