Best-of-Tails: Bridging Optimism and Pessimism in Inference-Time Alignment

Hsiang Hsu; Eric Lei; Chun-Fu Chen

Best-of-Tails: Bridging Optimism and Pessimism in Inference-Time Alignment

Hsiang Hsu, Eric Lei, Chun-Fu Chen

TL;DR

It is shown theoretically that light-tailed regimes favor optimism to unearth high-quality outliers, whereas heavy-tailed regimes require pessimism to guard against reward mis-calibration in the extremes, and Best-of-Tails is introduced, an adaptive inference-time alignment framework that uses Tsallis divergence as a tunable regularizer to provide a finer granularity of interpolation between these extremes.

Abstract

Inference-time alignment effectively steers large language models (LLMs) by generating multiple candidates from a reference model and selecting among them with an imperfect reward model. However, current strategies face a fundamental dilemma: ``optimistic'' approaches like Best-of-$N$ suffer from reward hacking, while ``pessimistic'' regularized methods often stifle the exploration needed to discover high-quality responses. In this work, we formalize this trade-off through the lens of regret minimization, demonstrating that the optimal strategy depends critically on the tail behavior of the reward distribution. We show theoretically that light-tailed regimes favor optimism to unearth high-quality outliers, whereas heavy-tailed regimes require pessimism to guard against reward mis-calibration in the extremes. Guided by this insight, we introduce Best-of-Tails (BoT), an adaptive inference-time alignment framework that uses Tsallis divergence as a tunable regularizer to provide a finer granularity of interpolation between these extremes. BoT uses the Hill estimator to characterize reward-tail heaviness on a per-prompt basis and dynamically adjusts its selection rule to balance exploration gains against alignment error. Across math, multiple-choice reasoning, and human-preference evaluations, BoT improves alignment performance across a range of reference and reward model configurations relative to fixed-strategy baselines.

Best-of-Tails: Bridging Optimism and Pessimism in Inference-Time Alignment

TL;DR

Abstract

suffer from reward hacking, while ``pessimistic'' regularized methods often stifle the exploration needed to discover high-quality responses. In this work, we formalize this trade-off through the lens of regret minimization, demonstrating that the optimal strategy depends critically on the tail behavior of the reward distribution. We show theoretically that light-tailed regimes favor optimism to unearth high-quality outliers, whereas heavy-tailed regimes require pessimism to guard against reward mis-calibration in the extremes. Guided by this insight, we introduce Best-of-Tails (BoT), an adaptive inference-time alignment framework that uses Tsallis divergence as a tunable regularizer to provide a finer granularity of interpolation between these extremes. BoT uses the Hill estimator to characterize reward-tail heaviness on a per-prompt basis and dynamically adjusts its selection rule to balance exploration gains against alignment error. Across math, multiple-choice reasoning, and human-preference evaluations, BoT improves alignment performance across a range of reference and reward model configurations relative to fixed-strategy baselines.

Paper Structure (38 sections, 6 theorems, 62 equations, 11 figures, 1 algorithm)

This paper contains 38 sections, 6 theorems, 62 equations, 11 figures, 1 algorithm.

Introduction
Taxonomy of Inference-Time Alignment through Regret Analysis
Reward Tails and Alignment Regret
Decomposing the Regret Bound
Light vs. Heavy Tails: When to be Optimistic?
BoT: A Tail-Adaptive Alignment
Deciding the Prompt-Dependent α(x)
Practical Implementation
Empirical Study
Final Remark
Impact Statement.
Disclaimer.
Omitted Proofs and Theoretical Results
Proof of Proposition \ref{['prop:general-regret-upper-bound']}
Proof of Proposition \ref{['prop:bon-itp-tail']}
...and 23 more sections

Key Result

Proposition 1

For a given prompt $x\in{\mathcal{X}}$, the inference-time regret for a general inference-time alignment policy $\hat{\pi}_w(y|x)$, defined via a re-weighting function $w(y|x)$, admits the following upper boundA more general upper bound is provided in Appendix app:proof:prop:general-regret-upper-bou Here, $D_\textsf{TV}(\cdot\|\cdot)$ is the total variation lehmann2005testing. $\Delta(\pi_w(\cdot|

Figures (11)

Figure 1: Conceptual illustration of selection probabilities (colored solid lines) for optimistic, pessimistic, and the proposed BoT strategies. The plots depict how these strategies re-weight candidates under light-tailed (left) versus heavy-tailed (right) reward distributions (black dashed lines). While optimism consistently concentrates mass on the highest rewards (risking reward hacking) and pessimism remains conservative (risking under-exploration), BoT adaptively shifts its strategy: it mimics optimism in light-tailed regimes to exploit safe gains, but pivots toward robust, conservative selection in heavy-tailed regimes to prevent over-optimization.
Figure 2: Optimistic (sBoN/BoN), pessimistic (ITP), and adaptive (BoT) strategies on GSM8K, MMLU, and MATH (rows) across different reference models and proxy rewards (columns). Each curve traces the trajectory of True Reward ($r^*$) vs. Proxy Reward ($\hat{r}$) as the sample size $N$ increases from $2^0$ to $2^{10}$, with marker size proportional to $N$. The smallest marker represents $N=1$ (standard sampling), while the largest represents $N=1024$. Optimistic strategies typically over-optimize the proxy reward as $N$ grows, leading to reward hacking (where $r^*$ degrades despite increasing $\hat{r}$). Conversely, ITP tends to saturate early, failing to leverage larger $N$ for further gains. The proposed BoT successfully navigates this trade-off, achieving higher true and proxy rewards without succumbing to reward hacking.
Figure 3: Left: Performance comparison of optimistic (sBoN/BoN), pessimistic (ITP), and adaptive (BoT) strategies on AlpacaFarm, using Llama-RM as the proxy and Alpaca-RM as the true reward. Right: The prompt-specific Hill estimates $\hat{\kappa}(x)$ and the corresponding adaptive parameter $\alpha(x)$. By design, BoT minimizes inference regret by shifting $\alpha \to 2$ under heavy tails (large $\hat{\kappa}$) to enforce robustness, while allowing $\alpha \to 1$ under light tails (small $\hat{\kappa}$) to enable aggressive exploration.
Figure A.4: Comparison of optimistic (sBoN/BoN), pessimistic (ITP), and adaptive (BoT) strategies on GSM8K across different reference models and proxy rewards. Each curve traces the trajectory of True Reward ($r^*$) vs. Proxy Reward ($\hat{r}$) as the sample size $N$ increases from $2^0$ to $2^{10}$, with marker size proportional to $N$. The smallest marker represents $N=1$ (standard sampling), while the largest represents $N=1024$.
Figure A.5: Comparison of optimistic (sBoN/BoN), pessimistic (ITP), and adaptive (BoT) strategies on MMLU across different reference models and proxy rewards. Each curve traces the trajectory of True Reward ($r^*$) vs. Proxy Reward ($\hat{r}$) as the sample size $N$ increases from $2^0$ to $2^{10}$, with marker size proportional to $N$. The smallest marker represents $N=1$ (standard sampling), while the largest represents $N=1024$.
...and 6 more figures

Theorems & Definitions (7)

Proposition 1
Proposition 2
Proposition 3
Proposition 4
Lemma A.1
proof
Lemma A.2: Hölder's inequality ledoux2001concentration

Best-of-Tails: Bridging Optimism and Pessimism in Inference-Time Alignment

TL;DR

Abstract

Best-of-Tails: Bridging Optimism and Pessimism in Inference-Time Alignment

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (7)