Table of Contents
Fetching ...

Best-of-$\infty$ -- Asymptotic Performance of Test-Time LLM Ensembling

Junpei Komiyama, Daisuke Oba, Masafumi Oyamada

TL;DR

This work proposes an adaptive generation scheme that selects $N$ based on answer agreement, thereby efficiently allocating inference-time computation in the limit of N and extends the framework to weighted ensembles of multiple LLMs, showing that such mixtures can outperform any individual model.

Abstract

We study best-of-$N$ for large language models (LLMs) where the selection is based on majority voting. In particular, we analyze the limit $N \to \infty$, which we denote as \boinflower. While this approach achieves impressive performance in the limit, it requires an infinite test-time budget. To address this, we propose an adaptive generation scheme that selects $N$ based on answer agreement, thereby efficiently allocating inference-time computation. Beyond adaptivity, we extend the framework to weighted ensembles of multiple LLMs, showing that such mixtures can outperform any individual model. The optimal ensemble weighting is formulated and efficiently computed as a mixed-integer linear program. Extensive experiments demonstrate the effectiveness of our approach.

Best-of-$\infty$ -- Asymptotic Performance of Test-Time LLM Ensembling

TL;DR

This work proposes an adaptive generation scheme that selects based on answer agreement, thereby efficiently allocating inference-time computation in the limit of N and extends the framework to weighted ensembles of multiple LLMs, showing that such mixtures can outperform any individual model.

Abstract

We study best-of- for large language models (LLMs) where the selection is based on majority voting. In particular, we analyze the limit , which we denote as \boinflower. While this approach achieves impressive performance in the limit, it requires an infinite test-time budget. To address this, we propose an adaptive generation scheme that selects based on answer agreement, thereby efficiently allocating inference-time computation. Beyond adaptivity, we extend the framework to weighted ensembles of multiple LLMs, showing that such mixtures can outperform any individual model. The optimal ensemble weighting is formulated and efficiently computed as a mixed-integer linear program. Extensive experiments demonstrate the effectiveness of our approach.

Paper Structure

This paper contains 43 sections, 7 theorems, 35 equations, 17 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

(Consistency) Assume that the LLM generates a finite number of answers $1,2,\dots,s$. For ease of discussion, let $p_j$ be the probability of answer $j$ and assume that $p_1 > p_2 \ge p_3 \ge \ldots \ge p_s > 0$. Namely, there are no ties for the most frequent answer, and each answer is generated wi

Figures (17)

  • Figure 1: Accuracy of Best-of-$N$ with majority voting as a function of $N$ (GPT-OSS-20B (Medium)) with four datasets jia2024aimeopencompass2025aimerein2023gpqagraduatelevelgoogleproofqahendrycksmath2021. Green line indicates the asymptotic accuracy of $N \rightarrow \infty$. For each problem, BoN benefits from increasing $N$, at least from $N = 10^1$ to $10^2$.
  • Figure 2: An illustration of adaptive sampling (Algorithm \ref{['alg:adaptive_sampling']}). The histogram shows the distribution of answers generated by an LLM for a single problem. Each answer generation can be viewed as a sample from the underlying distribution. Blue indicates the most frequent answer, and orange indicates the others. In the top example, three generations agree, so sampling stops. In the bottom example, more samples are needed to determine the majority. This maximizes the accuracy under a given compute budget. Confidence in the majority is based on the Bayes factor.
  • Figure 3: Visualization of the non-concave objective function $f(w)$ over the weight simplex $w$. The yellow simplex corresponds to $w$ in the simplex of the weights of the three LLMs. The gray region of the five polytopes (= five problems) are the region where the weighted majority of the corresponding weight correctly answer to the problem. The optimal solution is the intersection of four polytopes at the center, which corresponds to the case where four out of five problems are correctly answered.
  • Figure 4: Cost-analysis of our proposed method and fixed BoN. GPT-OSS-20B on MATH500. "Adaptive" Algorithm \ref{['alg:adaptive_sampling']} with average sample size of $\bar{N} = 3$ achieves the same accuracy as "fixed" sample of $N=10$, and the algorithm with average sample size $\bar{N} \approx 10$ achieves the same accuracy as fixed $N=100$. Thus, the adaptive sampling in this plot reduced the computation times by 2x-5x order. Both approach the best-of-$\infty$ performance (green dashed line).
  • Figure 5: Performance comparison of the LLM ensemble of EXAONE-Deep-32B, MetaStone-S1-32B, Phi-4-reasoning, Qwen3-30B-A3B-Thinking, and GPT-OSS-20B on GPQA-Diamond. The weight is optimized to $w = (0.0176, 0.0346, 0.2690, 0.4145, 0.2644)$. The LLM ensemble outperforms any single LLM with $N \ge 5$ and approaches the blue dashed line of best-of-$\infty$ performance.
  • ...and 12 more figures

Theorems & Definitions (15)

  • Theorem 1
  • Example 1: AIME2025
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof : Proof of Theorem \ref{['thm_consistency']}
  • ...and 5 more