Table of Contents
Fetching ...

Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management

Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, Jiecao Chen

TL;DR

The paper tackles the context-length bottleneck in reinforcement learning fine-tuning of LLMs for long-horizon, multi-turn tool use. It introduces a summarization-augmented MDP and a policy-gradient decomposition that enables end-to-end optimization of both tool-use behavior and summarization strategies, instantiated as SUPO (SUmmarization augmented Policy Optimization). Empirically, SUPO surpasses baselines in interactive function calling and searching tasks while maintaining the same or shorter working context, and test-time expansion of summarization rounds further boosts performance on complex tasks. This work demonstrates that principled context management via learned summaries is a scalable and effective approach to train RL agents beyond fixed context windows, with implications for broader long-horizon LLM applications.

Abstract

We study reinforcement learning (RL) fine-tuning of large language model (LLM) agents for long-horizon multi-turn tool use, where context length quickly becomes a fundamental bottleneck. Existing RL pipelines can suffer from degraded instruction following, excessive rollout costs, and most importantly, strict context limits. To address these challenges, we introduce summarization-based context management to training. In specific, it periodically compresses the tool using history by LLM-generated summaries that retain task-relevant information to keep a compact context while enabling the agent to scale beyond the fixed context window. Building on this formulation, we derive a policy gradient representation that seamlessly enables standard LLM RL infrastructures to optimize both tool-use behaviors as well as summarization strategies in an end-to-end fashion. We instantiate this framework with \underline{SU}mmarization augmented \underline{P}olicy \underline{O}ptimization (\texttt{SUPO}), an LLM RL algorithm that enables long-horizon training beyond a fixed context limit. Experiments on interactive function calling and searching tasks demonstrate that \texttt{SUPO} significantly improves the success rate while maintaining the same or even lower working context length compared to baselines. We also demonstrate that for complex searching tasks, \texttt{SUPO} can further improve the evaluation performance when scaling test-time maximum round of summarization beyond that of training time. Our results establish summarization-based context management as a principled and scalable approach for training RL agents beyond a fixed context length limit.

Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management

TL;DR

The paper tackles the context-length bottleneck in reinforcement learning fine-tuning of LLMs for long-horizon, multi-turn tool use. It introduces a summarization-augmented MDP and a policy-gradient decomposition that enables end-to-end optimization of both tool-use behavior and summarization strategies, instantiated as SUPO (SUmmarization augmented Policy Optimization). Empirically, SUPO surpasses baselines in interactive function calling and searching tasks while maintaining the same or shorter working context, and test-time expansion of summarization rounds further boosts performance on complex tasks. This work demonstrates that principled context management via learned summaries is a scalable and effective approach to train RL agents beyond fixed context windows, with implications for broader long-horizon LLM applications.

Abstract

We study reinforcement learning (RL) fine-tuning of large language model (LLM) agents for long-horizon multi-turn tool use, where context length quickly becomes a fundamental bottleneck. Existing RL pipelines can suffer from degraded instruction following, excessive rollout costs, and most importantly, strict context limits. To address these challenges, we introduce summarization-based context management to training. In specific, it periodically compresses the tool using history by LLM-generated summaries that retain task-relevant information to keep a compact context while enabling the agent to scale beyond the fixed context window. Building on this formulation, we derive a policy gradient representation that seamlessly enables standard LLM RL infrastructures to optimize both tool-use behaviors as well as summarization strategies in an end-to-end fashion. We instantiate this framework with \underline{SU}mmarization augmented \underline{P}olicy \underline{O}ptimization (\texttt{SUPO}), an LLM RL algorithm that enables long-horizon training beyond a fixed context limit. Experiments on interactive function calling and searching tasks demonstrate that \texttt{SUPO} significantly improves the success rate while maintaining the same or even lower working context length compared to baselines. We also demonstrate that for complex searching tasks, \texttt{SUPO} can further improve the evaluation performance when scaling test-time maximum round of summarization beyond that of training time. Our results establish summarization-based context management as a principled and scalable approach for training RL agents beyond a fixed context length limit.

Paper Structure

This paper contains 29 sections, 2 theorems, 17 equations, 5 figures, 3 tables, 2 algorithms.

Key Result

Proposition 3.1

Under $\mathcal{M}_{\mathcal{V}}^{\mathtt{sum}}$, the working context length satisfies $|s_t|+|a_t|\leq L + 2L_{\mathcal{A}} + L_{\mathcal{O}} + |v_{\mathtt{sum}}|$. Here $L$ is the summarization threshold, $L_{\mathcal{A}}$ denotes the maximum number of new tokens of one LLM calling, and $L_{\mathc

Figures (5)

  • Figure 1: An illustration of the different rollout processes of $\mathcal{M}_{\mathcal{V}}$ (upper) and $\mathcal{M}_{\mathcal{V}}^{\mathtt{sum}}$ (lower).
  • Figure 2: Training curves and validation curves of SUPO (working context length 64K, effective context length 192K) and GRPO (working context length 64K). Here the score metric in the training curve at each step refers to the averaged score of all $8$ rollouts in the training batch at that step. CodeGym runs for $1$ epoch. BrowseComp-Plus runs for $5$ epochs.
  • Figure 3: Training dynamics of summarization rate \ref{['eq: sum rate']} and conditional success rate \ref{['eq: succ sum rate']}. The experiments are with working context length 64K and an effective context length 192K. The experiment for SUPO on BrowseComp-Plus is run for $5$ epochs, while the experiment for SUPO (w/o overlong masking) is run for $3$ epochs for its degenerated performance in order to save computation.
  • Figure 4: Mean # tool calling.
  • Figure 5: Test-time scaling.

Theorems & Definitions (4)

  • Proposition 3.1: Working context length under $\mathcal{M}_{\mathcal{V}}^{\mathtt{sum}}$
  • Theorem 3.2: Policy gradient representation of $\mathcal{M}_{\mathcal{V}}^{\mathtt{sum}}$
  • proof : Proof of Theorem \ref{['thm: policy gradient']}
  • proof : Proof of Theorem \ref{['thm: policy gradient']}