Table of Contents
Fetching ...

Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao

TL;DR

The paper tackles the limitations of single-model test-time scaling (TTS) and static, homogeneous MAS by introducing Team-of-Thoughts (ToT), a heterogeneous MAS that uses an orchestrator to call specialized tool agents. It formalizes orchestration calibration and a self-assessment mechanism to profile agent strengths, enabling dynamic, task-aware activation and budget allocation. Across reasoning and code-generation benchmarks, ToT achieves state-of-the-art accuracies (e.g., $96.67\%$ on AIME24 and $72.53\%$ on LiveCodeBench) while delivering superior accuracy-to-token efficiency compared to homogeneous baselines and prior MAS approaches. This work demonstrates that coordinating diverse post-trained models via tool calling can realize significant test-time scaling with practical impact for high-stakes reasoning tasks.

Abstract

Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models. To address this, we introduce Team-of-Thoughts, a novel MAS architecture that leverages the complementary capabilities of heterogeneous agents via an orchestrator-tool paradigm. Our framework introduces two key mechanisms to optimize performance: (1) an orchestrator calibration scheme that identifies models with superior coordination capabilities, and (2) a self-assessment protocol where tool agents profile their own domain expertise to account for variations in post-training skills. During inference, the orchestrator dynamically activates the most suitable tool agents based on these proficiency profiles. Experiments on five reasoning and code generation benchmarks show that Team-of-Thoughts delivers consistently superior task performance. Notably, on AIME24 and LiveCodeBench, our approach achieves accuracies of 96.67% and 72.53%, respectively, substantially outperforming homogeneous role-play baselines, which score 80% and 65.93%.

Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

TL;DR

The paper tackles the limitations of single-model test-time scaling (TTS) and static, homogeneous MAS by introducing Team-of-Thoughts (ToT), a heterogeneous MAS that uses an orchestrator to call specialized tool agents. It formalizes orchestration calibration and a self-assessment mechanism to profile agent strengths, enabling dynamic, task-aware activation and budget allocation. Across reasoning and code-generation benchmarks, ToT achieves state-of-the-art accuracies (e.g., on AIME24 and on LiveCodeBench) while delivering superior accuracy-to-token efficiency compared to homogeneous baselines and prior MAS approaches. This work demonstrates that coordinating diverse post-trained models via tool calling can realize significant test-time scaling with practical impact for high-stakes reasoning tasks.

Abstract

Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models. To address this, we introduce Team-of-Thoughts, a novel MAS architecture that leverages the complementary capabilities of heterogeneous agents via an orchestrator-tool paradigm. Our framework introduces two key mechanisms to optimize performance: (1) an orchestrator calibration scheme that identifies models with superior coordination capabilities, and (2) a self-assessment protocol where tool agents profile their own domain expertise to account for variations in post-training skills. During inference, the orchestrator dynamically activates the most suitable tool agents based on these proficiency profiles. Experiments on five reasoning and code generation benchmarks show that Team-of-Thoughts delivers consistently superior task performance. Notably, on AIME24 and LiveCodeBench, our approach achieves accuracies of 96.67% and 72.53%, respectively, substantially outperforming homogeneous role-play baselines, which score 80% and 65.93%.
Paper Structure (20 sections, 8 equations, 3 figures, 3 tables)

This paper contains 20 sections, 8 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of Team-of-Thoughts. (a) While standard reasoning methods rely on a single model (Token-decomposed Thoughts) or homogeneous multi-agent groups (Role-diverse Thoughts), Team-of-Thoughts incorporates heterogeneous models to ensure broad coverage of the solution space. (b) Our framework integrates an orchestrator for tool-agent management, utilizing an initialization pipeline that includes orchestrator calibration and agent self-profiling. At inference time, the orchestrator identifies the optimal tools for the input query and synthesizes their reasoning trajectories into a high-confidence final response.
  • Figure 2: Schematic comparison of language modeling methods.(Top left) Standard Inference: A single model predicts target $X$ directly from input $D$. (Top middle) Agentic Reasoning: Methods like CoT generate intermediate steps to refine the prediction distribution. (Bottom left) Consensus-based MAS: Multiple agents reason iteratively until consensus is reached. (Right) Team-of-Thoughts MAS: An orchestrator leverages heterogeneous tool agents. During calibration, agents self-assess their proficiency on question types $T$. During inference, the orchestrator selectively invokes agents based on these assessments, aligning the prediction with the target while maintaining token efficiency.
  • Figure 3: Comparison of different tool agent-selection methods in AIME2024 and MBPP+, aggregated using GPT-5-mini as the orchestrator. Dashed lines indicate baseline GPT-5-mini performance.