Table of Contents
Fetching ...

AgentFold: Long-Horizon Web Agents with Proactive Context Management

Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, Pengjun Xie, Fei Huang, Siheng Chen, Jingren Zhou, Yong Jiang

TL;DR

AgentFold tackles the challenge of long-horizon web information seeking by redesigning context as a proactive cognitive workspace. It introduces multi-scale state summaries and a latest interaction component, enabling a folding operation that can granulate or consolidate history to preserve crucial details while pruning noise. Trained via Fold-Generator and supervised fine-tuning on open-source LLMs, AgentFold-30B-A3B achieves state-of-the-art results among open-source models and competitive performance with proprietary agents, while maintaining highly compact context (≈7k tokens after 100 turns) and supporting hundreds of interaction steps. This approach demonstrates significant practical potential for scalable, efficient long-horizon reasoning in web agents, with RL-based folding policy optimization proposed as a future direction.

Abstract

LLM-based web agents show immense promise for information seeking, yet their effectiveness on long-horizon tasks is hindered by a fundamental trade-off in context management. Prevailing ReAct-based agents suffer from context saturation as they accumulate noisy, raw histories, while methods that fixedly summarize the full history at each step risk the irreversible loss of critical details. Addressing these, we introduce AgentFold, a novel agent paradigm centered on proactive context management, inspired by the human cognitive process of retrospective consolidation. AgentFold treats its context as a dynamic cognitive workspace to be actively sculpted, rather than a passive log to be filled. At each step, it learns to execute a `folding' operation, which manages its historical trajectory at multiple scales: it can perform granular condensations to preserve vital, fine-grained details, or deep consolidations to abstract away entire multi-step sub-tasks. The results on prominent benchmarks are striking: with simple supervised fine-tuning (without continual pre-training or RL), our AgentFold-30B-A3B agent achieves 36.2% on BrowseComp and 47.3% on BrowseComp-ZH. Notably, this performance not only surpasses or matches open-source models of a dramatically larger scale, such as the DeepSeek-V3.1-671B-A37B, but also surpasses leading proprietary agents like OpenAI's o4-mini.

AgentFold: Long-Horizon Web Agents with Proactive Context Management

TL;DR

AgentFold tackles the challenge of long-horizon web information seeking by redesigning context as a proactive cognitive workspace. It introduces multi-scale state summaries and a latest interaction component, enabling a folding operation that can granulate or consolidate history to preserve crucial details while pruning noise. Trained via Fold-Generator and supervised fine-tuning on open-source LLMs, AgentFold-30B-A3B achieves state-of-the-art results among open-source models and competitive performance with proprietary agents, while maintaining highly compact context (≈7k tokens after 100 turns) and supporting hundreds of interaction steps. This approach demonstrates significant practical potential for scalable, efficient long-horizon reasoning in web agents, with RL-based folding policy optimization proposed as a future direction.

Abstract

LLM-based web agents show immense promise for information seeking, yet their effectiveness on long-horizon tasks is hindered by a fundamental trade-off in context management. Prevailing ReAct-based agents suffer from context saturation as they accumulate noisy, raw histories, while methods that fixedly summarize the full history at each step risk the irreversible loss of critical details. Addressing these, we introduce AgentFold, a novel agent paradigm centered on proactive context management, inspired by the human cognitive process of retrospective consolidation. AgentFold treats its context as a dynamic cognitive workspace to be actively sculpted, rather than a passive log to be filled. At each step, it learns to execute a `folding' operation, which manages its historical trajectory at multiple scales: it can perform granular condensations to preserve vital, fine-grained details, or deep consolidations to abstract away entire multi-step sub-tasks. The results on prominent benchmarks are striking: with simple supervised fine-tuning (without continual pre-training or RL), our AgentFold-30B-A3B agent achieves 36.2% on BrowseComp and 47.3% on BrowseComp-ZH. Notably, this performance not only surpasses or matches open-source models of a dramatically larger scale, such as the DeepSeek-V3.1-671B-A37B, but also surpasses leading proprietary agents like OpenAI's o4-mini.

Paper Structure

This paper contains 13 sections, 3 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Our AgentFold-30B-A3B agent demonstrates remarkable performance on challenging long-horizon benchmarks, matching or surpassing agents with significantly larger model sizes. This is enabled by its proactive context folding, which maintains a highly concise and focused context that reaches only 7k tokens after 100 turns of interaction and is capable of scaling to 500 turns.
  • Figure 2: Overview of AgentFold at an intermediate step. The two key parts in AgentFold' context are: Multi-Scale State Summaries (several folded blocks recording previous information) and Latest Interaction (a full record of the latest step). AgentFold responds with four blocks: thinking, folding, explanation, and tool call (which leads to an appended tool response). The folding directive has two operation modes: granular condensation that folds one single step with useful information reserved and deep consolidation that folds several steps with a coarse summary especially when these steps complete a sub-task and the intermediate details are not critical for further task-solving.
  • Figure 3: Analysis of AgentFold's context on trajectories sampled from BrowseComp. (a) AgentFold's context length grows at a remarkably slow, sub-linear rate, less than doubling from approximately 3.5k to 7k over 100 turns. As our model's max context is 128k, this indicates a promising potential for AgentFold for tackling complex and long-horizon tasks. (b) Our Deep consolidation operation in AgentFold merges multiple past steps into a single summary, thereby maintaining a significantly more structural and concise context compared to the popular ReAct.
  • Figure 4: Scaling properties of interaction turns (tool calls). This demonstrates the profound potential of AgentFold to tirelessly and robustly work for hundreds of steps for humans.
  • Figure 5: Case study for illustration of AgentFold. See detailed content in Table \ref{['tab:case1_traj']}, Figure \ref{['fig:case1_context']} and \ref{['fig:case1_response']}. After a series of failure attempts happened (steps 6 to 16), AgentFold notices that this direction might be a dead end, folds these intermediate steps into one conclusion, plans to switch to other search directions, and decides the new search queries.
  • ...and 4 more figures