Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute

Jianhao Chen; Zishuo Xun; Bocheng Zhou; Han Qi; Hangfan Zhang; Qiaosheng Zhang; Yang Chen; Wei Hu; Yuzhong Qu; Wanli Ouyang; Shuyue Hu

Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute

Jianhao Chen, Zishuo Xun, Bocheng Zhou, Han Qi, Hangfan Zhang, Qiaosheng Zhang, Yang Chen, Wei Hu, Yuzhong Qu, Wanli Ouyang, Shuyue Hu

TL;DR

ModelSwitch addresses the cost-inefficiency of test-time sampling in LLMs by using multiple, diverse LLMs and a consistency-driven switching signal. Building on the repeated-sampling-then-voting paradigm, it introduces a weighted voting mechanism and model-switching policy to exploit complementary strengths across models, achieving higher accuracy with fewer samples. Theoretical results show conditions under which ModelSwitch strictly outperform-single-model approaches and provide a bound on efficiency gains, while experiments across seven benchmarks demonstrate state-of-the-art performance and substantial inference-cost reductions. This approach offers a practical, scalable path to more efficient and reliable reasoning with ensembles of LLMs, with potential to integrate stronger verification methods for further gains.

Abstract

This paper presents a simple, effective, and cost-efficient strategy to improve LLM performance by scaling test-time compute. Our strategy builds upon the repeated-sampling-then-voting framework, with a novel twist: incorporating multiple models, even weaker ones, to leverage their complementary strengths that potentially arise from diverse training data and paradigms. By using consistency as a signal, our strategy dynamically switches between models. Theoretical analysis highlights the efficiency and performance advantages of our strategy. Extensive experiments on six datasets demonstrate that our strategy not only outperforms self-consistency and state-of-the-art multi-agent debate approaches, but also significantly reduces inference costs. Additionally, ModelSwitch requires only a few comparable LLMs to achieve optimal performance and can be extended with verification methods, demonstrating the potential of leveraging multiple LLMs in the generation-verification paradigm.

Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute

TL;DR

Abstract

Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (8)