The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation
Farid Bagirov, Mikhail Arkhipov, Ksenia Sycheva, Evgeniy Glukhov, Egor Bogomolov
TL;DR
The paper tackles the problem that RLVR can erode diversity and BoN effectiveness when optimizing for higher BoN values. It derives unbiased gradient estimators to directly maximize the continuous $max@k$ metric, covering both on-policy and off-policy settings, and demonstrates that aligning training with BoN inference improves max@k on coding benchmarks. Key findings show continuous rewards are crucial and that off-policy BoN often yields the strongest gains (up to $+3.7$ p.p. on several datasets) while maintaining or improving lower-k performance. The work provides a practical, scalable approach to reconcile RL fine-tuning with Best-of-N inference, enabling more robust problem-solving in verifiable-reward settings.
Abstract
The application of Reinforcement Learning with Verifiable Rewards (RLVR) to mathematical and coding domains has demonstrated significant improvements in the reasoning and problem-solving abilities of Large Language Models. Despite its success in single generation problem solving, the reinforcement learning fine-tuning process may harm the model's exploration ability, as reflected in decreased diversity of generations and a resulting degradation of performance during Best-of-N sampling for large N values. In this work, we focus on optimizing the max@k metric, a continuous generalization of pass@k. We derive an unbiased on-policy gradient estimate for direct optimization of this metric. Furthermore, we extend our derivations to the off-policy updates, a common element in modern RLVR algorithms, that allows better sample efficiency. Empirically, we show that our objective effectively optimizes max@k metric in off-policy scenarios, aligning the model with the Best-of-N inference strategy.
