Table of Contents
Fetching ...

COMPASS: Enhancing Agent Long-Horizon Reasoning with Evolving Context

Guangya Wan, Mingyang Ling, Xiaoqi Ren, Rujun Han, Sheng Li, Zizhao Zhang

TL;DR

COMPASS targets the bottleneck of context management in long-horizon LLM tasks by introducing a hierarchical, three-agent framework that separates tactical execution, strategic oversight, and context organization. The Main Agent handles reasoning and tool use, the Meta-Thinker provides asynchronous strategic interventions, and the Context Manager compresses histories into focused briefs to guide progress. Across GAIA, BrowseComp, and HLE, COMPASS achieves substantial accuracy gains (up to ~20% relative) and improves strategic reliability via four meta-thinking metrics; extensions like Context-12B and COMPASS-TTS further boost efficiency and scalability, enabling performance on par with established DeepResearch agents under parallel sampling. Together, these contributions demonstrate a practical pathway to robust, scalable autonomous reasoning in long-horizon settings, with clear directions for broader-domain evaluation and open-model integration in future work.

Abstract

Long-horizon tasks that require sustained reasoning and multiple tool interactions remain challenging for LLM agents: small errors compound across steps, and even state-of-the-art models often hallucinate or lose coherence. We identify context management as the central bottleneck -- extended histories cause agents to overlook critical evidence or become distracted by irrelevant information, thus failing to replan or reflect from previous mistakes. To address this, we propose COMPASS (Context-Organized Multi-Agent Planning and Strategy System), a lightweight hierarchical framework that separates tactical execution, strategic oversight, and context organization into three specialized components: (1) a Main Agent that performs reasoning and tool use, (2) a Meta-Thinker that monitors progress and issues strategic interventions, and (3) a Context Manager that maintains concise, relevant progress briefs for different reasoning stages. Across three challenging benchmarks -- GAIA, BrowseComp, and Humanity's Last Exam -- COMPASS improves accuracy by up to 20% relative to both single- and multi-agent baselines. We further introduce a test-time scaling extension that elevates performance to match established DeepResearch agents, and a post-training pipeline that delegates context management to smaller models for enhanced efficiency.

COMPASS: Enhancing Agent Long-Horizon Reasoning with Evolving Context

TL;DR

COMPASS targets the bottleneck of context management in long-horizon LLM tasks by introducing a hierarchical, three-agent framework that separates tactical execution, strategic oversight, and context organization. The Main Agent handles reasoning and tool use, the Meta-Thinker provides asynchronous strategic interventions, and the Context Manager compresses histories into focused briefs to guide progress. Across GAIA, BrowseComp, and HLE, COMPASS achieves substantial accuracy gains (up to ~20% relative) and improves strategic reliability via four meta-thinking metrics; extensions like Context-12B and COMPASS-TTS further boost efficiency and scalability, enabling performance on par with established DeepResearch agents under parallel sampling. Together, these contributions demonstrate a practical pathway to robust, scalable autonomous reasoning in long-horizon settings, with clear directions for broader-domain evaluation and open-model integration in future work.

Abstract

Long-horizon tasks that require sustained reasoning and multiple tool interactions remain challenging for LLM agents: small errors compound across steps, and even state-of-the-art models often hallucinate or lose coherence. We identify context management as the central bottleneck -- extended histories cause agents to overlook critical evidence or become distracted by irrelevant information, thus failing to replan or reflect from previous mistakes. To address this, we propose COMPASS (Context-Organized Multi-Agent Planning and Strategy System), a lightweight hierarchical framework that separates tactical execution, strategic oversight, and context organization into three specialized components: (1) a Main Agent that performs reasoning and tool use, (2) a Meta-Thinker that monitors progress and issues strategic interventions, and (3) a Context Manager that maintains concise, relevant progress briefs for different reasoning stages. Across three challenging benchmarks -- GAIA, BrowseComp, and Humanity's Last Exam -- COMPASS improves accuracy by up to 20% relative to both single- and multi-agent baselines. We further introduce a test-time scaling extension that elevates performance to match established DeepResearch agents, and a post-training pipeline that delegates context management to smaller models for enhanced efficiency.

Paper Structure

This paper contains 41 sections, 6 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Motivation for COMPASS. ReAct-style SAS accumulate full dialogue histories, leading to context exhaustion and performance degradation. MAS with Human-in-the-loop improves performance through human's feedback but require human effort and lack scalability. COMPASS introduces automated monitor and context management agents that track reasoning, organize structured context, and maintain reliability with full automation.
  • Figure 2: The COMPASS dual-loop framework. The Main Agent performs tool interactions by following the instructions from continually refreshed context; the Meta-Thinker asynchronously monitors trajectories and triggers strategic decisions, and the Context Manager compresses full histories (from structured notes) into concise, contextual aware briefs back to Main Agent.
  • Figure 3: Strategic meta-decision metrics across agent variants. Bars report scores for four metrics to measure streatgic reasoning: Persist (PAR) and Pivot (PVR), and Conclude (CA) and Continue (ERCR); see Table \ref{['tab:meta-thinking']} for examples and formal definitions.
  • Figure 4: Context-12B Performance on BrowseComp. Pass @ 1(%), strateg adequacy (×100), and token efficiency all improve progressively. DPO yields substantial efficiency gains without sacrificing accuracy.
  • Figure 5: Performance (Pass@1) vs. token cost for three COMPASS-TTS sampling methods on the BrowseComp benchmark. Increasing the number of parallel samples improves accuracy but also raises token costs.
  • ...and 1 more figures