Best-of-$\infty$ -- Asymptotic Performance of Test-Time LLM Ensembling

Junpei Komiyama; Daisuke Oba; Masafumi Oyamada

Best-of-$\infty$ -- Asymptotic Performance of Test-Time LLM Ensembling

Junpei Komiyama, Daisuke Oba, Masafumi Oyamada

TL;DR

This work proposes an adaptive generation scheme that selects $N$ based on answer agreement, thereby efficiently allocating inference-time computation in the limit of N and extends the framework to weighted ensembles of multiple LLMs, showing that such mixtures can outperform any individual model.

Abstract

We study best-of-$N$ for large language models (LLMs) where the selection is based on majority voting. In particular, we analyze the limit $N \to \infty$, which we denote as \boinflower. While this approach achieves impressive performance in the limit, it requires an infinite test-time budget. To address this, we propose an adaptive generation scheme that selects $N$ based on answer agreement, thereby efficiently allocating inference-time computation. Beyond adaptivity, we extend the framework to weighted ensembles of multiple LLMs, showing that such mixtures can outperform any individual model. The optimal ensemble weighting is formulated and efficiently computed as a mixed-integer linear program. Extensive experiments demonstrate the effectiveness of our approach.

Best-of-$\infty$ -- Asymptotic Performance of Test-Time LLM Ensembling

TL;DR

This work proposes an adaptive generation scheme that selects

based on answer agreement, thereby efficiently allocating inference-time computation in the limit of N and extends the framework to weighted ensembles of multiple LLMs, showing that such mixtures can outperform any individual model.

Abstract

We study best-of-

for large language models (LLMs) where the selection is based on majority voting. In particular, we analyze the limit

, which we denote as \boinflower. While this approach achieves impressive performance in the limit, it requires an infinite test-time budget. To address this, we propose an adaptive generation scheme that selects

based on answer agreement, thereby efficiently allocating inference-time computation. Beyond adaptivity, we extend the framework to weighted ensembles of multiple LLMs, showing that such mixtures can outperform any individual model. The optimal ensemble weighting is formulated and efficiently computed as a mixed-integer linear program. Extensive experiments demonstrate the effectiveness of our approach.

Best-of-$\infty$ -- Asymptotic Performance of Test-Time LLM Ensembling

TL;DR

Abstract

Best-of-$\infty$ -- Asymptotic Performance of Test-Time LLM Ensembling

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (15)