Table of Contents
Fetching ...

Best of mini-N in-loop Sampling: A Contextual Quality Reward Model for Reliable and Efficient Best-of-N Sampling

Hyung Gyu Rho, Sian Lee

TL;DR

The paper identifies a critical reliability gap in Best-of-N sampling: standard BoN learns relative preferences but lacks an explicit acceptability signal, which can yield false acceptances on hard prompts as sample size grows. It introduces a choice-based reward model that augments pairwise data with an outside option, enabling context-dependent acceptability via a normalized reward and a multinomial logit formulation. Building on this, the authors propose Best of mini-N in-loop, an adaptive, budget-partitioning inference strategy that calibrates thresholds to either maximize reliability (alignment guardrail) or speed (inference accelerator). Empirical results on IMDB sentiment show the guardrail achieves a 70% reduction in reliability failures with minimal quality loss, while the accelerator speeds up inference by over 22% with high recall, demonstrating a flexible trade-off between reliability and efficiency with principled calibration ideas. The framework offers practical guidance for deploying reliable and efficient BoN-style alignment in real-world settings by explicitly modeling contextual acceptability and using data-driven thresholds.

Abstract

Modern preference alignment techniques, such as Best-of-N (BoN) sampling, rely on reward models trained with pairwise comparison data. While effective at learning relative preferences, this paradigm fails to capture a signal of response acceptability, leaving systems vulnerable to selecting the least bad of many unacceptable options. This is particularly problematic for hard prompts, where the risk of such false acceptances increases with the number of samples. In this paper, we address this critical reliability gap by introducing a new data collection and modeling framework. By augmenting preference data with an outside option, inspired by discrete choice models, we train a reward model that can distinguish not just what is better, but what is good enough. We leverage this capability to create an adaptive inference strategy, best of mini-N in-loop, which partitions the generation budget into sequential loops with a calibrated, early-exit condition. Our experiments show that when tuned as an alignment guardrail, it reduces reliability failures by 70%, and when tuned as an inference accelerator, it improves average inference speed by over 22% in IMDB-sentiment setting. We thus provide a principled and flexible framework for practitioners to explicitly manage the trade-off between reliability and computational efficiency.

Best of mini-N in-loop Sampling: A Contextual Quality Reward Model for Reliable and Efficient Best-of-N Sampling

TL;DR

The paper identifies a critical reliability gap in Best-of-N sampling: standard BoN learns relative preferences but lacks an explicit acceptability signal, which can yield false acceptances on hard prompts as sample size grows. It introduces a choice-based reward model that augments pairwise data with an outside option, enabling context-dependent acceptability via a normalized reward and a multinomial logit formulation. Building on this, the authors propose Best of mini-N in-loop, an adaptive, budget-partitioning inference strategy that calibrates thresholds to either maximize reliability (alignment guardrail) or speed (inference accelerator). Empirical results on IMDB sentiment show the guardrail achieves a 70% reduction in reliability failures with minimal quality loss, while the accelerator speeds up inference by over 22% with high recall, demonstrating a flexible trade-off between reliability and efficiency with principled calibration ideas. The framework offers practical guidance for deploying reliable and efficient BoN-style alignment in real-world settings by explicitly modeling contextual acceptability and using data-driven thresholds.

Abstract

Modern preference alignment techniques, such as Best-of-N (BoN) sampling, rely on reward models trained with pairwise comparison data. While effective at learning relative preferences, this paradigm fails to capture a signal of response acceptability, leaving systems vulnerable to selecting the least bad of many unacceptable options. This is particularly problematic for hard prompts, where the risk of such false acceptances increases with the number of samples. In this paper, we address this critical reliability gap by introducing a new data collection and modeling framework. By augmenting preference data with an outside option, inspired by discrete choice models, we train a reward model that can distinguish not just what is better, but what is good enough. We leverage this capability to create an adaptive inference strategy, best of mini-N in-loop, which partitions the generation budget into sequential loops with a calibrated, early-exit condition. Our experiments show that when tuned as an alignment guardrail, it reduces reliability failures by 70%, and when tuned as an inference accelerator, it improves average inference speed by over 22% in IMDB-sentiment setting. We thus provide a principled and flexible framework for practitioners to explicitly manage the trade-off between reliability and computational efficiency.

Paper Structure

This paper contains 35 sections, 16 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: In standard BoN, the False Positive Count increases with the number of samples ($N$), even as the mean reward improves. This highlights a critical reliability vulnerability.
  • Figure 2: Performance of the alignment guardrail configuration compared to the BoN-32 baseline. The Mini-16 in 2 loops setting dramatically reduces the False Positive Count with only a marginal decrease in mean reward.
  • Figure 3: Performance of the inference accelerator configuration. The Mini-16 in 2 loops setting provides the fastest mean execution time, outperforming the BoN-32 baseline by over 22%.