RoBoN: Routed Online Best-of-n for Test-Time Scaling with Multiple LLMs
Jonathan Geuter, Gregor Kornhardt
TL;DR
RoBoN tackles test-time scaling by routing a fixed budget of generations across multiple LLMs in sequence, using a scoring rule that blends a plug-in reward with an agreement signal. The approach is training-free, preserves compute parity, and works with any reward model, enabling cross-LLM diversity to surpass single-model BoN baselines on five reasoning benchmarks. Empirical results show RoBoN achieves improvements for larger budgets (up to around 3–5 percentage points in absolute accuracy) and outperforms a uniform multi-model portfolio, while maintaining exact compute usage. Limitations include runtime overhead versus parallel BoN and reliance on exact-match agreement, suggesting future work on embedding-based agreement and semi-parallel variants.
Abstract
Best-of-$n$ is a widely used test-time scaling approach for LLM inference. Yet despite evidence that LLMs exhibit complementary strengths across tasks, traditionally best-of-$n$ relies on a single model to generate responses. We propose RoBoN (Routed Online Best-of-$n$), a sequential multi-LLM alternative to the prevailing single-model best-of-$n$. Given a suite of models $\{m_i\}_{i=1}^M$, RoBoN sequentially routes generations one-by-one across models, based on scores computed using a reward model and an agreement signal on the predicted responses. This online routing requires no additional training, keeps compute parity, and works with any plug-in reward model. Across reasoning benchmarks (MATH500, OlympiadBench, MinervaMath, GSM8K, MMLU), RoBoN consistently outperforms standard best-of-$n$ applied to each individual model for larger $n$, with gains of up to 3.4\% in absolute accuracy, and also improves over a uniform multi-model portfolio baseline. Our results indicate that diversity across models can be exploited at inference to improve best-of-$n$ performance over any constituent model alone, providing a simple, training-free path to test-time scaling with multiple LLMs.
