Table of Contents
Fetching ...

RoBoN: Routed Online Best-of-n for Test-Time Scaling with Multiple LLMs

Jonathan Geuter, Gregor Kornhardt

TL;DR

RoBoN tackles test-time scaling by routing a fixed budget of generations across multiple LLMs in sequence, using a scoring rule that blends a plug-in reward with an agreement signal. The approach is training-free, preserves compute parity, and works with any reward model, enabling cross-LLM diversity to surpass single-model BoN baselines on five reasoning benchmarks. Empirical results show RoBoN achieves improvements for larger budgets (up to around 3–5 percentage points in absolute accuracy) and outperforms a uniform multi-model portfolio, while maintaining exact compute usage. Limitations include runtime overhead versus parallel BoN and reliance on exact-match agreement, suggesting future work on embedding-based agreement and semi-parallel variants.

Abstract

Best-of-$n$ is a widely used test-time scaling approach for LLM inference. Yet despite evidence that LLMs exhibit complementary strengths across tasks, traditionally best-of-$n$ relies on a single model to generate responses. We propose RoBoN (Routed Online Best-of-$n$), a sequential multi-LLM alternative to the prevailing single-model best-of-$n$. Given a suite of models $\{m_i\}_{i=1}^M$, RoBoN sequentially routes generations one-by-one across models, based on scores computed using a reward model and an agreement signal on the predicted responses. This online routing requires no additional training, keeps compute parity, and works with any plug-in reward model. Across reasoning benchmarks (MATH500, OlympiadBench, MinervaMath, GSM8K, MMLU), RoBoN consistently outperforms standard best-of-$n$ applied to each individual model for larger $n$, with gains of up to 3.4\% in absolute accuracy, and also improves over a uniform multi-model portfolio baseline. Our results indicate that diversity across models can be exploited at inference to improve best-of-$n$ performance over any constituent model alone, providing a simple, training-free path to test-time scaling with multiple LLMs.

RoBoN: Routed Online Best-of-n for Test-Time Scaling with Multiple LLMs

TL;DR

RoBoN tackles test-time scaling by routing a fixed budget of generations across multiple LLMs in sequence, using a scoring rule that blends a plug-in reward with an agreement signal. The approach is training-free, preserves compute parity, and works with any reward model, enabling cross-LLM diversity to surpass single-model BoN baselines on five reasoning benchmarks. Empirical results show RoBoN achieves improvements for larger budgets (up to around 3–5 percentage points in absolute accuracy) and outperforms a uniform multi-model portfolio, while maintaining exact compute usage. Limitations include runtime overhead versus parallel BoN and reliance on exact-match agreement, suggesting future work on embedding-based agreement and semi-parallel variants.

Abstract

Best-of- is a widely used test-time scaling approach for LLM inference. Yet despite evidence that LLMs exhibit complementary strengths across tasks, traditionally best-of- relies on a single model to generate responses. We propose RoBoN (Routed Online Best-of-), a sequential multi-LLM alternative to the prevailing single-model best-of-. Given a suite of models , RoBoN sequentially routes generations one-by-one across models, based on scores computed using a reward model and an agreement signal on the predicted responses. This online routing requires no additional training, keeps compute parity, and works with any plug-in reward model. Across reasoning benchmarks (MATH500, OlympiadBench, MinervaMath, GSM8K, MMLU), RoBoN consistently outperforms standard best-of- applied to each individual model for larger , with gains of up to 3.4\% in absolute accuracy, and also improves over a uniform multi-model portfolio baseline. Our results indicate that diversity across models can be exploited at inference to improve best-of- performance over any constituent model alone, providing a simple, training-free path to test-time scaling with multiple LLMs.

Paper Structure

This paper contains 16 sections, 6 equations, 3 figures, 6 tables, 2 algorithms.

Figures (3)

  • Figure 1: RoBoN significantly outperforms best-of-$n$ with individual models for large $n$. Average accuracies across datasets and methods, with 1-sigma confidence intervals. The degrading performance on some datasets as $n$ increases is likely due to reward hacking skalse2025definingcharacterizingrewardhacking.
  • Figure 2: Average accuracy (averaged over MATH500, OlympiadBench, and MinervaMath) for RoBoN with different values of $\alpha$.
  • Figure 3: Average share of models selected across different values of $n$ in RoBoN, averaged over datasets. RoBoN selects deepseek-coder-6.7b in the majority of cases; however, RoBoN significantly outperforms this model in terms of accuracy, cmp. Figure \ref{['fig:accuracies']}.