Table of Contents
Fetching ...

TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

Yongchao Chen, Jiefeng Chen, Rui Meng, Ji Yin, Na Li, Chuchu Fan, Chi Wang, Tomas Pfister, Jinsung Yoon

TL;DR

This work introduces Tool-Use Mixture (TUMIX), a framework that harnesses diverse tool-enabled agents for test-time scaling of LLM reasoning. By running multiple agents in parallel with distinct tool-use strategies and refining outputs across rounds, TUMIX achieves consistent accuracy gains on challenging benchmarks while significantly reducing inference costs through adaptive termination guided by an LLM judge. The study highlights the primacy of agent diversity and quality over mere scale, and demonstrates that both human-designed and LLM-generated agents can substantially expand the effective agent pool. Additional findings show that terminating refinement early with an LLM-based judge preserves performance at roughly half the computational cost, and that scaling up thoughtfully (via TUMIX+) yields further improvements at higher expense. Overall, the work provides practical, reproducible guidance for integrating Code Interpreter and Search tools into multi-agent reasoning and offers open-source resources to advance tool-augmented LLMs.

Abstract

While integrating tools like Code Interpreter and Search has significantly enhanced Large Language Model (LLM) reasoning in models like ChatGPT Agent and Gemini-Pro, practical guidance on optimal tool use is lacking. The core challenge is effectively combining textual reasoning, coding, and search for diverse questions. In this paper, we propose Tool-Use Mixture (TUMIX), an ensemble framework that runs multiple agents in parallel, each employing distinct tool-use strategies and answer paths. Agents in TUMIX iteratively share and refine responses based on the question and previous answers. In experiments, TUMIX achieves significant gains over state-of-the-art tool-augmented and test-time scaling methods, delivering an average accuracy improvement of up to 3.55% over the best baseline on Gemini-2.5-Pro and Gemini-2.5-Flash across key reasoning benchmarks, with near-equal inference costs. We find that agent diversity and quality are crucial and can be enhanced by using LLMs to auto-optimize agent designs. Furthermore, TUMIX can halt refinement upon reaching sufficient confidence, preserving performance at only 49% of the inference cost. Further scaling can achieve higher performance, albeit at a greater cost.

TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

TL;DR

This work introduces Tool-Use Mixture (TUMIX), a framework that harnesses diverse tool-enabled agents for test-time scaling of LLM reasoning. By running multiple agents in parallel with distinct tool-use strategies and refining outputs across rounds, TUMIX achieves consistent accuracy gains on challenging benchmarks while significantly reducing inference costs through adaptive termination guided by an LLM judge. The study highlights the primacy of agent diversity and quality over mere scale, and demonstrates that both human-designed and LLM-generated agents can substantially expand the effective agent pool. Additional findings show that terminating refinement early with an LLM-based judge preserves performance at roughly half the computational cost, and that scaling up thoughtfully (via TUMIX+) yields further improvements at higher expense. Overall, the work provides practical, reproducible guidance for integrating Code Interpreter and Search tools into multi-agent reasoning and offers open-source resources to advance tool-augmented LLMs.

Abstract

While integrating tools like Code Interpreter and Search has significantly enhanced Large Language Model (LLM) reasoning in models like ChatGPT Agent and Gemini-Pro, practical guidance on optimal tool use is lacking. The core challenge is effectively combining textual reasoning, coding, and search for diverse questions. In this paper, we propose Tool-Use Mixture (TUMIX), an ensemble framework that runs multiple agents in parallel, each employing distinct tool-use strategies and answer paths. Agents in TUMIX iteratively share and refine responses based on the question and previous answers. In experiments, TUMIX achieves significant gains over state-of-the-art tool-augmented and test-time scaling methods, delivering an average accuracy improvement of up to 3.55% over the best baseline on Gemini-2.5-Pro and Gemini-2.5-Flash across key reasoning benchmarks, with near-equal inference costs. We find that agent diversity and quality are crucial and can be enhanced by using LLMs to auto-optimize agent designs. Furthermore, TUMIX can halt refinement upon reaching sufficient confidence, preserving performance at only 49% of the inference cost. Further scaling can achieve higher performance, albeit at a greater cost.

Paper Structure

This paper contains 24 sections, 4 equations, 13 figures, 15 tables, 2 algorithms.

Figures (13)

  • Figure 1: Comparison of tool-augmented test-time scaling methods on Gemini-2.5-Pro (first row) and Gemini-2.5-Flash (second row) across HLE, GPQA, and AIME 24&25. Except for methods without test-time scaling (w/o TTS) or additional scaling (TUMIX+), all methods in the same subplot use nearly the same number of inferences and tokens. For fair comparison, methods that originally lacked tool use are run with strong tool-augmented agents instead of text-only agents. Each score is the average of three repetitive runs.
  • Figure 2: Overview of TUMIX framework. At each iteration, the responses from all agents in the previous round are concatenated with the original question, forming a joint prompt for the next round. This prompt is then provided to all agents (either the same or new agent groups) to produce refined answers. The subsequent prompts follow the structure illustrated in the above. The design of the agents, their number and specialization, the refinement termination criteria, and the selection strategies are the key factors that determine the effectiveness of the framework.
  • Figure 3: Evolution of coverage, individual agent scores, and average scores across refinement rounds. Coverage decreases monotonically across all benchmarks. For HLE and AIME, the average score rises over the initial rounds before plateauing. For GPQA, the average score improves early on but subsequently declines with further refinement.
  • Figure 4: Sankey diagram of the evolution of correctly answering agents across 2,500 HLE questions over refinement rounds. Based on the distribution dynamics, we define five categories: all wrong (0), few correct (1–3), moderate correct (4–11), high correct (12–14), and all correct (15).
  • Figure 5: Coverage and average score vs. rounds under varying agent diversity and quality. In 15_Agents, all 15 varied pre-designed agents generate one answer each round. In 3_Agents, three strong agents (C$^+$, CS$_{\text{gs}}$, and CSG$_{\text{gs}}$) each samples 5 times per round. In 1_Agent_Strong and 1_Agent_Weak, CS$_{\text{gs}}$ and w/o TTS sample 15 times, respectively.
  • ...and 8 more figures