Table of Contents
Fetching ...

The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation

Farid Bagirov, Mikhail Arkhipov, Ksenia Sycheva, Evgeniy Glukhov, Egor Bogomolov

TL;DR

The paper tackles the problem that RLVR can erode diversity and BoN effectiveness when optimizing for higher BoN values. It derives unbiased gradient estimators to directly maximize the continuous $max@k$ metric, covering both on-policy and off-policy settings, and demonstrates that aligning training with BoN inference improves max@k on coding benchmarks. Key findings show continuous rewards are crucial and that off-policy BoN often yields the strongest gains (up to $+3.7$ p.p. on several datasets) while maintaining or improving lower-k performance. The work provides a practical, scalable approach to reconcile RL fine-tuning with Best-of-N inference, enabling more robust problem-solving in verifiable-reward settings.

Abstract

The application of Reinforcement Learning with Verifiable Rewards (RLVR) to mathematical and coding domains has demonstrated significant improvements in the reasoning and problem-solving abilities of Large Language Models. Despite its success in single generation problem solving, the reinforcement learning fine-tuning process may harm the model's exploration ability, as reflected in decreased diversity of generations and a resulting degradation of performance during Best-of-N sampling for large N values. In this work, we focus on optimizing the max@k metric, a continuous generalization of pass@k. We derive an unbiased on-policy gradient estimate for direct optimization of this metric. Furthermore, we extend our derivations to the off-policy updates, a common element in modern RLVR algorithms, that allows better sample efficiency. Empirically, we show that our objective effectively optimizes max@k metric in off-policy scenarios, aligning the model with the Best-of-N inference strategy.

The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation

TL;DR

The paper tackles the problem that RLVR can erode diversity and BoN effectiveness when optimizing for higher BoN values. It derives unbiased gradient estimators to directly maximize the continuous metric, covering both on-policy and off-policy settings, and demonstrates that aligning training with BoN inference improves max@k on coding benchmarks. Key findings show continuous rewards are crucial and that off-policy BoN often yields the strongest gains (up to p.p. on several datasets) while maintaining or improving lower-k performance. The work provides a practical, scalable approach to reconcile RL fine-tuning with Best-of-N inference, enabling more robust problem-solving in verifiable-reward settings.

Abstract

The application of Reinforcement Learning with Verifiable Rewards (RLVR) to mathematical and coding domains has demonstrated significant improvements in the reasoning and problem-solving abilities of Large Language Models. Despite its success in single generation problem solving, the reinforcement learning fine-tuning process may harm the model's exploration ability, as reflected in decreased diversity of generations and a resulting degradation of performance during Best-of-N sampling for large N values. In this work, we focus on optimizing the max@k metric, a continuous generalization of pass@k. We derive an unbiased on-policy gradient estimate for direct optimization of this metric. Furthermore, we extend our derivations to the off-policy updates, a common element in modern RLVR algorithms, that allows better sample efficiency. Empirically, we show that our objective effectively optimizes max@k metric in off-policy scenarios, aligning the model with the Best-of-N inference strategy.

Paper Structure

This paper contains 32 sections, 35 equations, 2 figures, 14 tables.

Figures (2)

  • Figure 1: Qwen2.5-Coder-7B-Instruct performance on CodeContests dataset before and after GRPO fine-tuning.
  • Figure 2: Distribution of the entropy before and after GRPO fine-tuning.