Table of Contents
Fetching ...

ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization

Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Xinmiao Yu, Dingchu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Minhao Cheng, Shuai Wang, Hong Cheng, Jingren Zhou

TL;DR

The paper tackles the bottleneck of fixed context windows in knowledge-intensive web search by introducing ReSum, a paradigm that periodically compresses conversation history into compact summaries to enable long-horizon reasoning with minimal architectural changes.It contributes ReSumTool-30B, a specialized summary model tailored for goal-oriented web search, and ReSum-GRPO, an RL adaptation that segments trajectories at summary points and broadcasts trajectory-level advantages to train agents effectively in the summary-conditioned setting.Empirical results across multiple benchmarks show consistent improvements over ReAct, with notable gains after RL adaptation, and demonstrate the approach's applicability to agents with extended context windows, achieving competitive performance with reduced training data.Overall, ReSum offers a lightweight, compatible path to extend the reasoning horizon of web agents, enabling more reliable, evidence-grounded search outcomes in complex, uncertain scenarios.

Abstract

Large Language Model (LLM)-based web agents demonstrate strong performance on knowledge-intensive tasks but are hindered by context window limitations in paradigms like ReAct. Complex queries involving multiple entities, intertwined relationships, and high uncertainty demand extensive search cycles that rapidly exhaust context budgets before reaching solutions. To overcome this challenge, we introduce ReSum, a novel paradigm that enables indefinite exploration through periodic context summarization. ReSum converts growing interaction histories into compact reasoning states, maintaining awareness of prior discoveries while bypassing context constraints. For paradigm adaptation, we propose ReSum-GRPO, integrating GRPO with segmented trajectory training and advantage broadcasting to familiarize agents with summary-conditioned reasoning. Extensive experiments on web agents across three benchmarks demonstrate that ReSum delivers an average absolute improvement of 4.5% over ReAct, with further gains of 8.2% following ReSum-GRPO training. Notably, with only 1K training samples, our WebResummer-30B (a ReSum-GRPO-trained version of WebSailor-30B) achieves 33.3% Pass@1 on BrowseComp-zh and 18.3% on BrowseComp-en, surpassing most open-source web agents.

ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization

TL;DR

The paper tackles the bottleneck of fixed context windows in knowledge-intensive web search by introducing ReSum, a paradigm that periodically compresses conversation history into compact summaries to enable long-horizon reasoning with minimal architectural changes.It contributes ReSumTool-30B, a specialized summary model tailored for goal-oriented web search, and ReSum-GRPO, an RL adaptation that segments trajectories at summary points and broadcasts trajectory-level advantages to train agents effectively in the summary-conditioned setting.Empirical results across multiple benchmarks show consistent improvements over ReAct, with notable gains after RL adaptation, and demonstrate the approach's applicability to agents with extended context windows, achieving competitive performance with reduced training data.Overall, ReSum offers a lightweight, compatible path to extend the reasoning horizon of web agents, enabling more reliable, evidence-grounded search outcomes in complex, uncertain scenarios.

Abstract

Large Language Model (LLM)-based web agents demonstrate strong performance on knowledge-intensive tasks but are hindered by context window limitations in paradigms like ReAct. Complex queries involving multiple entities, intertwined relationships, and high uncertainty demand extensive search cycles that rapidly exhaust context budgets before reaching solutions. To overcome this challenge, we introduce ReSum, a novel paradigm that enables indefinite exploration through periodic context summarization. ReSum converts growing interaction histories into compact reasoning states, maintaining awareness of prior discoveries while bypassing context constraints. For paradigm adaptation, we propose ReSum-GRPO, integrating GRPO with segmented trajectory training and advantage broadcasting to familiarize agents with summary-conditioned reasoning. Extensive experiments on web agents across three benchmarks demonstrate that ReSum delivers an average absolute improvement of 4.5% over ReAct, with further gains of 8.2% following ReSum-GRPO training. Notably, with only 1K training samples, our WebResummer-30B (a ReSum-GRPO-trained version of WebSailor-30B) achieves 33.3% Pass@1 on BrowseComp-zh and 18.3% on BrowseComp-en, surpassing most open-source web agents.

Paper Structure

This paper contains 23 sections, 6 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison between ReAct yao2023react and ReSum paradigms. Appending every observation, thought, and action in ReAct exhausts the context budget before multi-turn exploration completes. In contrast, ReSum periodically invokes a summary tool to condense history and resumes reasoning from the compressed summary, enabling indefinite exploration.
  • Figure 2: Context limits in ReAct constrain exploration. Using open-sourced WebSailor‑7B li2025websailor on the BrowseComp‑en bc_en, we compare the distributions of token consumption and tool call counts between correctly solved and failed trajectories. Failed cases use far more tool calls and tokens, suggesting trajectories are frequently truncated before resolution.
  • Figure 3: Illustration of ReSum‑GRPO. ReSum periodically summarizes long trajectories and restarts from compressed states, resulting in segmented trajectories. A single trajectory-level reward is computed from the final answer, normalized within the group to obtain a trajectory-level advantage, and that advantage is broadcast to all segments within the same rollout.
  • Figure 4: Training dynamics comparison between GRPO shao2024deepseekmath and ours ReSum-GRPO. ReSum-GRPO demonstrates higher initial rewards and faster convergence compared to standard GRPO.
  • Figure 5: Average token consumption vs. performance across different paradigms. Token consumption refers to the total number of tokens in a complete trajectory for a query.
  • ...and 1 more figures