Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao
TL;DR
The paper tackles the limitations of single-model test-time scaling (TTS) and static, homogeneous MAS by introducing Team-of-Thoughts (ToT), a heterogeneous MAS that uses an orchestrator to call specialized tool agents. It formalizes orchestration calibration and a self-assessment mechanism to profile agent strengths, enabling dynamic, task-aware activation and budget allocation. Across reasoning and code-generation benchmarks, ToT achieves state-of-the-art accuracies (e.g., $96.67\%$ on AIME24 and $72.53\%$ on LiveCodeBench) while delivering superior accuracy-to-token efficiency compared to homogeneous baselines and prior MAS approaches. This work demonstrates that coordinating diverse post-trained models via tool calling can realize significant test-time scaling with practical impact for high-stakes reasoning tasks.
Abstract
Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models. To address this, we introduce Team-of-Thoughts, a novel MAS architecture that leverages the complementary capabilities of heterogeneous agents via an orchestrator-tool paradigm. Our framework introduces two key mechanisms to optimize performance: (1) an orchestrator calibration scheme that identifies models with superior coordination capabilities, and (2) a self-assessment protocol where tool agents profile their own domain expertise to account for variations in post-training skills. During inference, the orchestrator dynamically activates the most suitable tool agents based on these proficiency profiles. Experiments on five reasoning and code generation benchmarks show that Team-of-Thoughts delivers consistently superior task performance. Notably, on AIME24 and LiveCodeBench, our approach achieves accuracies of 96.67% and 72.53%, respectively, substantially outperforming homogeneous role-play baselines, which score 80% and 65.93%.
