Table of Contents
Fetching ...

Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute

Jianhao Chen, Zishuo Xun, Bocheng Zhou, Han Qi, Hangfan Zhang, Qiaosheng Zhang, Yang Chen, Wei Hu, Yuzhong Qu, Wanli Ouyang, Shuyue Hu

TL;DR

ModelSwitch addresses the cost-inefficiency of test-time sampling in LLMs by using multiple, diverse LLMs and a consistency-driven switching signal. Building on the repeated-sampling-then-voting paradigm, it introduces a weighted voting mechanism and model-switching policy to exploit complementary strengths across models, achieving higher accuracy with fewer samples. Theoretical results show conditions under which ModelSwitch strictly outperform-single-model approaches and provide a bound on efficiency gains, while experiments across seven benchmarks demonstrate state-of-the-art performance and substantial inference-cost reductions. This approach offers a practical, scalable path to more efficient and reliable reasoning with ensembles of LLMs, with potential to integrate stronger verification methods for further gains.

Abstract

This paper presents a simple, effective, and cost-efficient strategy to improve LLM performance by scaling test-time compute. Our strategy builds upon the repeated-sampling-then-voting framework, with a novel twist: incorporating multiple models, even weaker ones, to leverage their complementary strengths that potentially arise from diverse training data and paradigms. By using consistency as a signal, our strategy dynamically switches between models. Theoretical analysis highlights the efficiency and performance advantages of our strategy. Extensive experiments on six datasets demonstrate that our strategy not only outperforms self-consistency and state-of-the-art multi-agent debate approaches, but also significantly reduces inference costs. Additionally, ModelSwitch requires only a few comparable LLMs to achieve optimal performance and can be extended with verification methods, demonstrating the potential of leveraging multiple LLMs in the generation-verification paradigm.

Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute

TL;DR

ModelSwitch addresses the cost-inefficiency of test-time sampling in LLMs by using multiple, diverse LLMs and a consistency-driven switching signal. Building on the repeated-sampling-then-voting paradigm, it introduces a weighted voting mechanism and model-switching policy to exploit complementary strengths across models, achieving higher accuracy with fewer samples. Theoretical results show conditions under which ModelSwitch strictly outperform-single-model approaches and provide a bound on efficiency gains, while experiments across seven benchmarks demonstrate state-of-the-art performance and substantial inference-cost reductions. This approach offers a practical, scalable path to more efficient and reliable reasoning with ensembles of LLMs, with potential to integrate stronger verification methods for further gains.

Abstract

This paper presents a simple, effective, and cost-efficient strategy to improve LLM performance by scaling test-time compute. Our strategy builds upon the repeated-sampling-then-voting framework, with a novel twist: incorporating multiple models, even weaker ones, to leverage their complementary strengths that potentially arise from diverse training data and paradigms. By using consistency as a signal, our strategy dynamically switches between models. Theoretical analysis highlights the efficiency and performance advantages of our strategy. Extensive experiments on six datasets demonstrate that our strategy not only outperforms self-consistency and state-of-the-art multi-agent debate approaches, but also significantly reduces inference costs. Additionally, ModelSwitch requires only a few comparable LLMs to achieve optimal performance and can be extended with verification methods, demonstrating the potential of leveraging multiple LLMs in the generation-verification paradigm.

Paper Structure

This paper contains 41 sections, 3 theorems, 19 equations, 9 figures, 4 tables, 1 algorithm.

Key Result

Proposition 5.1

The necessary condition for ModelSwitch to obtain the correct answer is $P>\frac{2}{m}$. The sufficient condition is

Figures (9)

  • Figure 1: Performance comparison of ModelSwitch and self-consistency wang2022self on Math lightman2023let and MathBench liu2024mathbench dataset. ModelSwitch switches between Gemini 1.5 Flash and GPT-4o mini on MATH, and between Gemma-2-9B-It and Llama-3.1-8B-Instruct on MathBench. The curves illustrate the performance of individual LLMs under self-consistency. For comparison, horizontal lines mark the single-sample performance of larger LLMs, including GPT-4o, Gemini 1.5 Pro, and Llama-3.1-70B-Instruct, as baselines. On MATH, ModelSwitch achieves 81% accuracy with only 35 samples, outperforming Gemini 1.5 Flash (79.8% accuracy with 512 samples) while being 14$\times$ more efficient. On MathBench, similar results are observed with open-source models: ModelSwitch achieves 75% accuracy (48 samples), outperforming Gemma-2-9B-It (73.7%, 512 samples) with 10$\times$ efficiency. Additionally, combining 9B and 8B models achieves performance (69%) comparable to a 70B model (68.7%) with only 7 samples.
  • Figure 2: Correlation between consistency (entropy) and accuracy of answers from six LLMs on MATH and MathBench. We use the transparency and size of the scatter points to indicate the number of queries corresponding to each point (the larger and more opaque the point, the greater the quantity). We report the correlation coefficient $r$ and the significance indicator $p$. In each subplot, entropy and accuracy exhibit a moderate ($0.5 < |r| \leq 0.8$) or high ($|r| > 0.8$) correlation, and this correlation is statistically significant ($p < 0.001$).
  • Figure 3: An overview of how ModelSwitch approach works between two LLMs. Given a sample budget $K$, it first queries the first LLM with $\frac{K}{2}$ samples. If self-consistency is achieved, the answer is accepted, saving 50% of the budget. Otherwise, it queries the second LLM with left $\frac{K}{2}$ samples and aggregates answers from both models, potentially improving performance if the second LLM excels at the task.
  • Figure 4: Performance comparison of self-consistency for each LLM (GPT-4o mini and Gemini 1.5 Flash) and ModelSwitch using both. We use horizontal lines to mark the single-sample results of more advanced LLMs GPT-4o and Gemini 1.5 Pro. We use the shade and size of red stars to differentiate the sampling budgets of ModelSwitch. The horizontal coordinate of the red star reflects the actual sampling counts of ModelSwitch.
  • Figure 5: Performance comparison of different multi-agent debate systems under the same budget of 15 samples relative to the accuracy of the best single LLM with one sample. ModelSwitch achieves the best results on four datasets and the second-best result on MATH and DATE.
  • ...and 4 more figures

Theorems & Definitions (8)

  • Proposition 5.1
  • proof
  • Example 1
  • Example 2
  • Theorem C.1
  • Definition C.2: Empirical distribution, type class
  • Lemma C.3: The Method of Types
  • proof