Table of Contents
Fetching ...

IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction

Guoxin Chen, Zile Qiao, Xuanzhong Chen, Donglei Yu, Haotian Xu, Wayne Xin Zhao, Ruihua Song, Wenbiao Yin, Huifeng Yin, Liwen Zhang, Kuan Li, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou

TL;DR

IterResearch tackles long-horizon reasoning by replacing accumulating context with a Markovian workspace that is periodically synthesized into a compact evolving report. It introduces Efficiency-Aware Policy Optimization (EAPO) with geometric reward discounting and adaptive downsampling to train agents under this iterative paradigm. Empirical results across six benchmarks show an average improvement of +14.5 percentage points over open-source baselines and substantial closeness to frontier proprietary systems, plus unprecedented interaction scaling up to 2048 rounds. Additionally, IterResearch functions as a model-agnostic prompting strategy, improving frontier models on long-horizon tasks by up to +19.2pp, indicating broad practical impact for both trained agents and prompting contexts.

Abstract

Recent advances in deep-research agents have shown promise for autonomous knowledge construction through dynamic reasoning over external sources. However, existing approaches rely on a mono-contextual paradigm that accumulates all information in a single, expanding context window, leading to context suffocation and noise contamination that limit their effectiveness on long-horizon tasks. We introduce IterResearch, a novel iterative deep-research paradigm that reformulates long-horizon research as a Markov Decision Process with strategic workspace reconstruction. By maintaining an evolving report as memory and periodically synthesizing insights, our approach preserves consistent reasoning capacity across arbitrary exploration depths. We further develop Efficiency-Aware Policy Optimization (EAPO), a reinforcement learning framework that incentivizes efficient exploration through geometric reward discounting and enables stable distributed training via adaptive downsampling. Extensive experiments demonstrate that IterResearch achieves substantial improvements over existing open-source agents with average +14.5pp across six benchmarks and narrows the gap with frontier proprietary systems. Remarkably, our paradigm exhibits unprecedented interaction scaling, extending to 2048 interactions with dramatic performance gains (from 3.5\% to 42.5\%), and serves as an effective prompting strategy, improving frontier models by up to 19.2pp over ReAct on long-horizon tasks. These findings position IterResearch as a versatile solution for long-horizon reasoning, effective both as a trained agent and as a prompting paradigm for frontier models.

IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction

TL;DR

IterResearch tackles long-horizon reasoning by replacing accumulating context with a Markovian workspace that is periodically synthesized into a compact evolving report. It introduces Efficiency-Aware Policy Optimization (EAPO) with geometric reward discounting and adaptive downsampling to train agents under this iterative paradigm. Empirical results across six benchmarks show an average improvement of +14.5 percentage points over open-source baselines and substantial closeness to frontier proprietary systems, plus unprecedented interaction scaling up to 2048 rounds. Additionally, IterResearch functions as a model-agnostic prompting strategy, improving frontier models on long-horizon tasks by up to +19.2pp, indicating broad practical impact for both trained agents and prompting contexts.

Abstract

Recent advances in deep-research agents have shown promise for autonomous knowledge construction through dynamic reasoning over external sources. However, existing approaches rely on a mono-contextual paradigm that accumulates all information in a single, expanding context window, leading to context suffocation and noise contamination that limit their effectiveness on long-horizon tasks. We introduce IterResearch, a novel iterative deep-research paradigm that reformulates long-horizon research as a Markov Decision Process with strategic workspace reconstruction. By maintaining an evolving report as memory and periodically synthesizing insights, our approach preserves consistent reasoning capacity across arbitrary exploration depths. We further develop Efficiency-Aware Policy Optimization (EAPO), a reinforcement learning framework that incentivizes efficient exploration through geometric reward discounting and enables stable distributed training via adaptive downsampling. Extensive experiments demonstrate that IterResearch achieves substantial improvements over existing open-source agents with average +14.5pp across six benchmarks and narrows the gap with frontier proprietary systems. Remarkably, our paradigm exhibits unprecedented interaction scaling, extending to 2048 interactions with dramatic performance gains (from 3.5\% to 42.5\%), and serves as an effective prompting strategy, improving frontier models by up to 19.2pp over ReAct on long-horizon tasks. These findings position IterResearch as a versatile solution for long-horizon reasoning, effective both as a trained agent and as a prompting paradigm for frontier models.

Paper Structure

This paper contains 40 sections, 8 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Performance of IterResearch against state-of-the-art open-source long-horizon agents.
  • Figure 2: (Top) The mono-contextual approach linearly accumulates all information into a single, ever-expanding context, leading to context suffocation and noise contamination. (Bottom) IterResearch models deep research as an extended MDP with workspace reconstruction. Each round begins with a reconstructed workspace $s_t$ containing the question, an evolving report $\mathcal{M}_t$, and immediate context. The agent generates structured decisions $d_t=$ (Think, Report, Action) and interacts with environment $\mathcal{E}$. The transition function $\mathcal{T}$ reconstructs the workspace, maintaining the Markov property while preventing context bloat and enabling sustained reasoning and information-seeking.
  • Figure 3: Interaction Scaling.
  • Figure 4: Performance comparison between IterResearch and ReAct as Prompting Strategies.
  • Figure 5: Training dynamics of our RL. (Left) Training Rewards Curve. (Right) Accuracy Curve.