TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

Yongchao Chen; Jiefeng Chen; Rui Meng; Ji Yin; Na Li; Chuchu Fan; Chi Wang; Tomas Pfister; Jinsung Yoon

TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

Yongchao Chen, Jiefeng Chen, Rui Meng, Ji Yin, Na Li, Chuchu Fan, Chi Wang, Tomas Pfister, Jinsung Yoon

TL;DR

This work introduces Tool-Use Mixture (TUMIX), a framework that harnesses diverse tool-enabled agents for test-time scaling of LLM reasoning. By running multiple agents in parallel with distinct tool-use strategies and refining outputs across rounds, TUMIX achieves consistent accuracy gains on challenging benchmarks while significantly reducing inference costs through adaptive termination guided by an LLM judge. The study highlights the primacy of agent diversity and quality over mere scale, and demonstrates that both human-designed and LLM-generated agents can substantially expand the effective agent pool. Additional findings show that terminating refinement early with an LLM-based judge preserves performance at roughly half the computational cost, and that scaling up thoughtfully (via TUMIX+) yields further improvements at higher expense. Overall, the work provides practical, reproducible guidance for integrating Code Interpreter and Search tools into multi-agent reasoning and offers open-source resources to advance tool-augmented LLMs.

Abstract

While integrating tools like Code Interpreter and Search has significantly enhanced Large Language Model (LLM) reasoning in models like ChatGPT Agent and Gemini-Pro, practical guidance on optimal tool use is lacking. The core challenge is effectively combining textual reasoning, coding, and search for diverse questions. In this paper, we propose Tool-Use Mixture (TUMIX), an ensemble framework that runs multiple agents in parallel, each employing distinct tool-use strategies and answer paths. Agents in TUMIX iteratively share and refine responses based on the question and previous answers. In experiments, TUMIX achieves significant gains over state-of-the-art tool-augmented and test-time scaling methods, delivering an average accuracy improvement of up to 3.55% over the best baseline on Gemini-2.5-Pro and Gemini-2.5-Flash across key reasoning benchmarks, with near-equal inference costs. We find that agent diversity and quality are crucial and can be enhanced by using LLMs to auto-optimize agent designs. Furthermore, TUMIX can halt refinement upon reaching sufficient confidence, preserving performance at only 49% of the inference cost. Further scaling can achieve higher performance, albeit at a greater cost.

TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

TL;DR

Abstract

TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)