Table of Contents
Fetching ...

Scaling Long-Horizon LLM Agent via Context-Folding

Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, Jiecao Chen

TL;DR

Long-horizon LLM agents are constrained by context length. Context Folding enables active context management by branching to sub-tasks and folding their intermediate steps, while FoldGRPO learns this behavior with token-level process rewards. On BrowseComp-Plus and SWE-Bench Verified, folding with a 32K active context and up to 10 branches matches or surpasses baselines that use much larger contexts and yields substantial efficiency gains. This work demonstrates that learnable context management is a principled and scalable pathway toward stronger, autonomous long-horizon LLM agents.

Abstract

Large language model (LLM) agents are fundamentally constrained by context length on long-horizon tasks. We introduce Context-Folding, a framework that empowers agents to actively manage their working context. An agent can procedurally branch into a sub-trajectory to handle a subtask and then fold it upon completion, collapsing the intermediate steps while retaining a concise summary of the outcome. To make this behavior learnable, we develop an end-to-end reinforcement learning framework FoldGRPO with specific process rewards to encourage effective task decomposition and context management. On complex long-horizon tasks (Deep Research and SWE), our folding agent matches or outperforms the ReAct baselines while using an active context 10$\times$ smaller and significantly outperforms models that rely on summarization-based context management.

Scaling Long-Horizon LLM Agent via Context-Folding

TL;DR

Long-horizon LLM agents are constrained by context length. Context Folding enables active context management by branching to sub-tasks and folding their intermediate steps, while FoldGRPO learns this behavior with token-level process rewards. On BrowseComp-Plus and SWE-Bench Verified, folding with a 32K active context and up to 10 branches matches or surpasses baselines that use much larger contexts and yields substantial efficiency gains. This work demonstrates that learnable context management is a principled and scalable pathway toward stronger, autonomous long-horizon LLM agents.

Abstract

Large language model (LLM) agents are fundamentally constrained by context length on long-horizon tasks. We introduce Context-Folding, a framework that empowers agents to actively manage their working context. An agent can procedurally branch into a sub-trajectory to handle a subtask and then fold it upon completion, collapsing the intermediate steps while retaining a concise summary of the outcome. To make this behavior learnable, we develop an end-to-end reinforcement learning framework FoldGRPO with specific process rewards to encourage effective task decomposition and context management. On complex long-horizon tasks (Deep Research and SWE), our folding agent matches or outperforms the ReAct baselines while using an active context 10 smaller and significantly outperforms models that rely on summarization-based context management.

Paper Structure

This paper contains 35 sections, 6 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Examples of context folding in long-horizon tasks: deep research (left) and agentic coding (right).
  • Figure 2: (a) Context Folding: a mechanism that enables the agent to actively manage its context through branching and return. (b) FoldGRPO: end-to-end optimization of context folding agent.
  • Figure 3: Agent performance on different data difficulty group. RL training yields consistent performance gains across easy, medium, and hard instances.
  • Figure 4: With RL training, we observe an increase in the number of tool calls, branching behavior, total number of tokens, and the number of searched pages.
  • Figure 5: Left: Pass@1 vs. agent max context length. Right: Pass@1 vs. number of combined questions. Multiple easy questions are combined into a single harder question to increase problem complexity; a higher number of combined questions indicates more required actions and a longer context to answer them correctly. See Section \ref{['sec:multiq']} for details.
  • ...and 5 more figures