COMPASS: Enhancing Agent Long-Horizon Reasoning with Evolving Context
Guangya Wan, Mingyang Ling, Xiaoqi Ren, Rujun Han, Sheng Li, Zizhao Zhang
TL;DR
COMPASS targets the bottleneck of context management in long-horizon LLM tasks by introducing a hierarchical, three-agent framework that separates tactical execution, strategic oversight, and context organization. The Main Agent handles reasoning and tool use, the Meta-Thinker provides asynchronous strategic interventions, and the Context Manager compresses histories into focused briefs to guide progress. Across GAIA, BrowseComp, and HLE, COMPASS achieves substantial accuracy gains (up to ~20% relative) and improves strategic reliability via four meta-thinking metrics; extensions like Context-12B and COMPASS-TTS further boost efficiency and scalability, enabling performance on par with established DeepResearch agents under parallel sampling. Together, these contributions demonstrate a practical pathway to robust, scalable autonomous reasoning in long-horizon settings, with clear directions for broader-domain evaluation and open-model integration in future work.
Abstract
Long-horizon tasks that require sustained reasoning and multiple tool interactions remain challenging for LLM agents: small errors compound across steps, and even state-of-the-art models often hallucinate or lose coherence. We identify context management as the central bottleneck -- extended histories cause agents to overlook critical evidence or become distracted by irrelevant information, thus failing to replan or reflect from previous mistakes. To address this, we propose COMPASS (Context-Organized Multi-Agent Planning and Strategy System), a lightweight hierarchical framework that separates tactical execution, strategic oversight, and context organization into three specialized components: (1) a Main Agent that performs reasoning and tool use, (2) a Meta-Thinker that monitors progress and issues strategic interventions, and (3) a Context Manager that maintains concise, relevant progress briefs for different reasoning stages. Across three challenging benchmarks -- GAIA, BrowseComp, and Humanity's Last Exam -- COMPASS improves accuracy by up to 20% relative to both single- and multi-agent baselines. We further introduce a test-time scaling extension that elevates performance to match established DeepResearch agents, and a post-training pipeline that delegates context management to smaller models for enhanced efficiency.
