Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window
Qiaoyu Tang, Hao Xiang, Le Yu, Bowen Yu, Yaojie Lu, Xianpei Han, Le Sun, WenJuan Zhang, Pengbo Wang, Shixuan Liu, Zhenru Zhang, Jianhong Tu, Hongyu Lin, Junyang Lin
TL;DR
This paper tackles the difficulty of enabling deep reasoning in long-horizon, multi-turn search agents. It introduces DeepMiner, a framework that combines (1) a reverse-constructed, verifiable QA data pipeline to raise task complexity, and (2) a dynamic sliding-window context management strategy that preserves reasoning traces while compressing older tool outputs, avoiding external summarization models. The approach is instantiated on Qwen3-32B and trained with supervised fine-tuning followed by reinforcement learning using Group Relative Policy Optimization, achieving substantial gains across BrowseComp, XBench-DeepSearch, and GAIA, including 33.5% accuracy on BrowseComp-en and the ability to sustain nearly 100 turns within a 32k context. This work demonstrates that high-quality, cross-document training signals and efficient, dynamic context handling can significantly extend the practical horizon of open-source web agents, narrowing the gap to proprietary systems and enabling more reliable long-horizon reasoning in real-world tasks.
Abstract
While recent advances in reasoning models have demonstrated cognitive behaviors through reinforcement learning, existing approaches struggle to invoke deep reasoning capabilities in multi-turn agents with long-horizon interactions. We propose DeepMiner, a novel framework that elicits such abilities by introducing high-difficulty training tasks and dynamic context window. DeepMiner presents a reverse construction method to generate complex but verifiable question-answer pairs from authentic web sources, which ensures the challenge and reliability of training data while injecting cognitive capabilities into multi-turn reasoning scenarios. We further design an elegant yet effective dynamic context management strategy for both training and inference, utilizing sliding window mechanisms while eliminating the dependency on external summarization models, thereby efficiently empowering the model to handle continuously expanding long-horizon contexts. Through reinforcement learning on Qwen3-32B, we develop DeepMiner-32B, which achieves substantial performance improvements across multiple search agent benchmarks. DeepMiner attains 33.5% accuracy on BrowseComp-en, surpassing the previous best open-source agent by almost 20 percentage points, and demonstrates consistent improvements on BrowseComp-zh, XBench-DeepSearch, and GAIA. Notably, our dynamic context management enables sustained interactions of nearly 100 turns within standard 32k context length, effectively addressing the context limitations that constrain existing multi-turn interaction systems.
