Table of Contents
Fetching ...

Best-of-Tails: Bridging Optimism and Pessimism in Inference-Time Alignment

Hsiang Hsu, Eric Lei, Chun-Fu Chen

TL;DR

It is shown theoretically that light-tailed regimes favor optimism to unearth high-quality outliers, whereas heavy-tailed regimes require pessimism to guard against reward mis-calibration in the extremes, and Best-of-Tails is introduced, an adaptive inference-time alignment framework that uses Tsallis divergence as a tunable regularizer to provide a finer granularity of interpolation between these extremes.

Abstract

Inference-time alignment effectively steers large language models (LLMs) by generating multiple candidates from a reference model and selecting among them with an imperfect reward model. However, current strategies face a fundamental dilemma: ``optimistic'' approaches like Best-of-$N$ suffer from reward hacking, while ``pessimistic'' regularized methods often stifle the exploration needed to discover high-quality responses. In this work, we formalize this trade-off through the lens of regret minimization, demonstrating that the optimal strategy depends critically on the tail behavior of the reward distribution. We show theoretically that light-tailed regimes favor optimism to unearth high-quality outliers, whereas heavy-tailed regimes require pessimism to guard against reward mis-calibration in the extremes. Guided by this insight, we introduce Best-of-Tails (BoT), an adaptive inference-time alignment framework that uses Tsallis divergence as a tunable regularizer to provide a finer granularity of interpolation between these extremes. BoT uses the Hill estimator to characterize reward-tail heaviness on a per-prompt basis and dynamically adjusts its selection rule to balance exploration gains against alignment error. Across math, multiple-choice reasoning, and human-preference evaluations, BoT improves alignment performance across a range of reference and reward model configurations relative to fixed-strategy baselines.

Best-of-Tails: Bridging Optimism and Pessimism in Inference-Time Alignment

TL;DR

It is shown theoretically that light-tailed regimes favor optimism to unearth high-quality outliers, whereas heavy-tailed regimes require pessimism to guard against reward mis-calibration in the extremes, and Best-of-Tails is introduced, an adaptive inference-time alignment framework that uses Tsallis divergence as a tunable regularizer to provide a finer granularity of interpolation between these extremes.

Abstract

Inference-time alignment effectively steers large language models (LLMs) by generating multiple candidates from a reference model and selecting among them with an imperfect reward model. However, current strategies face a fundamental dilemma: ``optimistic'' approaches like Best-of- suffer from reward hacking, while ``pessimistic'' regularized methods often stifle the exploration needed to discover high-quality responses. In this work, we formalize this trade-off through the lens of regret minimization, demonstrating that the optimal strategy depends critically on the tail behavior of the reward distribution. We show theoretically that light-tailed regimes favor optimism to unearth high-quality outliers, whereas heavy-tailed regimes require pessimism to guard against reward mis-calibration in the extremes. Guided by this insight, we introduce Best-of-Tails (BoT), an adaptive inference-time alignment framework that uses Tsallis divergence as a tunable regularizer to provide a finer granularity of interpolation between these extremes. BoT uses the Hill estimator to characterize reward-tail heaviness on a per-prompt basis and dynamically adjusts its selection rule to balance exploration gains against alignment error. Across math, multiple-choice reasoning, and human-preference evaluations, BoT improves alignment performance across a range of reference and reward model configurations relative to fixed-strategy baselines.
Paper Structure (38 sections, 6 theorems, 62 equations, 11 figures, 1 algorithm)

This paper contains 38 sections, 6 theorems, 62 equations, 11 figures, 1 algorithm.

Key Result

Proposition 1

For a given prompt $x\in{\mathcal{X}}$, the inference-time regret for a general inference-time alignment policy $\hat{\pi}_w(y|x)$, defined via a re-weighting function $w(y|x)$, admits the following upper boundA more general upper bound is provided in Appendix app:proof:prop:general-regret-upper-bou Here, $D_\textsf{TV}(\cdot\|\cdot)$ is the total variation lehmann2005testing. $\Delta(\pi_w(\cdot|

Figures (11)

  • Figure 1: Conceptual illustration of selection probabilities (colored solid lines) for optimistic, pessimistic, and the proposed BoT strategies. The plots depict how these strategies re-weight candidates under light-tailed (left) versus heavy-tailed (right) reward distributions (black dashed lines). While optimism consistently concentrates mass on the highest rewards (risking reward hacking) and pessimism remains conservative (risking under-exploration), BoT adaptively shifts its strategy: it mimics optimism in light-tailed regimes to exploit safe gains, but pivots toward robust, conservative selection in heavy-tailed regimes to prevent over-optimization.
  • Figure 2: Optimistic (sBoN/BoN), pessimistic (ITP), and adaptive (BoT) strategies on GSM8K, MMLU, and MATH (rows) across different reference models and proxy rewards (columns). Each curve traces the trajectory of True Reward ($r^*$) vs. Proxy Reward ($\hat{r}$) as the sample size $N$ increases from $2^0$ to $2^{10}$, with marker size proportional to $N$. The smallest marker represents $N=1$ (standard sampling), while the largest represents $N=1024$. Optimistic strategies typically over-optimize the proxy reward as $N$ grows, leading to reward hacking (where $r^*$ degrades despite increasing $\hat{r}$). Conversely, ITP tends to saturate early, failing to leverage larger $N$ for further gains. The proposed BoT successfully navigates this trade-off, achieving higher true and proxy rewards without succumbing to reward hacking.
  • Figure 3: Left: Performance comparison of optimistic (sBoN/BoN), pessimistic (ITP), and adaptive (BoT) strategies on AlpacaFarm, using Llama-RM as the proxy and Alpaca-RM as the true reward. Right: The prompt-specific Hill estimates $\hat{\kappa}(x)$ and the corresponding adaptive parameter $\alpha(x)$. By design, BoT minimizes inference regret by shifting $\alpha \to 2$ under heavy tails (large $\hat{\kappa}$) to enforce robustness, while allowing $\alpha \to 1$ under light tails (small $\hat{\kappa}$) to enable aggressive exploration.
  • Figure A.4: Comparison of optimistic (sBoN/BoN), pessimistic (ITP), and adaptive (BoT) strategies on GSM8K across different reference models and proxy rewards. Each curve traces the trajectory of True Reward ($r^*$) vs. Proxy Reward ($\hat{r}$) as the sample size $N$ increases from $2^0$ to $2^{10}$, with marker size proportional to $N$. The smallest marker represents $N=1$ (standard sampling), while the largest represents $N=1024$.
  • Figure A.5: Comparison of optimistic (sBoN/BoN), pessimistic (ITP), and adaptive (BoT) strategies on MMLU across different reference models and proxy rewards. Each curve traces the trajectory of True Reward ($r^*$) vs. Proxy Reward ($\hat{r}$) as the sample size $N$ increases from $2^0$ to $2^{10}$, with marker size proportional to $N$. The smallest marker represents $N=1$ (standard sampling), while the largest represents $N=1024$.
  • ...and 6 more figures

Theorems & Definitions (7)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Lemma A.1
  • proof
  • Lemma A.2: Hölder's inequality ledoux2001concentration