Table of Contents
Fetching ...

Inference-Time Reward Hacking in Large Language Models

Hadi Khalaf, Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, Flavio du Pin Calmon

TL;DR

The paper analyzes reward hacking at inference time when aligning large language models with proxy rewards $r_p$ and true rewards $r_t$, showing that increasing a tuning parameter yields a single peak in true performance due to the winner's curse. It introduces Best-of-Poisson (BoP) as a single-parameter, near-optimal approximation to the optimal reward-tilted policy, with a KL-divergence gap of about $8\times 10^{-4}$ relative to the theoretical tilt $\pi^*_{\lambda}(x)\propto \pi_{\text{ref}}(x) e^{\lambda r_p(x)}$, and develops HedgeTune to precisely locate the hacking threshold for BoN, SBoN, and BoP. The authors prove that the expected true reward $f(\theta)=\mathbb{E}_{X\sim \pi_\theta}[r_t(X)]$ exhibits at most one interior extremum under broad conditions, enabling practical hedging via calibrated parameters. Empirical results on verifiable benchmarks and human-preference data demonstrate that HedgeTune improves reward-distortion tradeoffs, mitigates reward hacking, and yields safer, more reliable inference-time alignment without retraining. Overall, the work provides a principled, computation-efficient framework for leveraging proxy signals while guarding against Goodhart-type failures in deployment.

Abstract

A common paradigm to improve the performance of large language models is optimizing for a reward model. Reward models assign a numerical score to an LLM's output that indicates, for example, how likely it is to align with user preferences or safety goals. However, reward models are never perfect. They inevitably function as proxies for complex desiderata such as correctness, helpfulness, and safety. By overoptimizing for a misspecified reward, we can subvert intended alignment goals and reduce overall performance, a phenomenon commonly referred to as reward hacking. In this work, we characterize reward hacking in inference-time alignment and demonstrate when and how we can mitigate it by hedging on the proxy reward. We study this phenomenon under Best-of-$n$ (BoN) and Soft Best-of-$n$ (SBoN), and we introduce Best-of-Poisson (BoP) that provides an efficient, near-exact approximation of the optimal reward-KL divergence policy at inference time. We show that the characteristic pattern of hacking as observed in practice (where the true reward first increases before declining) is an inevitable property of a broad class of inference-time mechanisms, including BoN and BoP. To counter this effect, we introduce HedgeTune, an efficient algorithm to find the optimal inference-time parameter. We demonstrate that hedging mitigates reward hacking and achieves superior reward-distortion tradeoffs on math, reasoning, and human-preference setups.

Inference-Time Reward Hacking in Large Language Models

TL;DR

The paper analyzes reward hacking at inference time when aligning large language models with proxy rewards and true rewards , showing that increasing a tuning parameter yields a single peak in true performance due to the winner's curse. It introduces Best-of-Poisson (BoP) as a single-parameter, near-optimal approximation to the optimal reward-tilted policy, with a KL-divergence gap of about relative to the theoretical tilt , and develops HedgeTune to precisely locate the hacking threshold for BoN, SBoN, and BoP. The authors prove that the expected true reward exhibits at most one interior extremum under broad conditions, enabling practical hedging via calibrated parameters. Empirical results on verifiable benchmarks and human-preference data demonstrate that HedgeTune improves reward-distortion tradeoffs, mitigates reward hacking, and yields safer, more reliable inference-time alignment without retraining. Overall, the work provides a principled, computation-efficient framework for leveraging proxy signals while guarding against Goodhart-type failures in deployment.

Abstract

A common paradigm to improve the performance of large language models is optimizing for a reward model. Reward models assign a numerical score to an LLM's output that indicates, for example, how likely it is to align with user preferences or safety goals. However, reward models are never perfect. They inevitably function as proxies for complex desiderata such as correctness, helpfulness, and safety. By overoptimizing for a misspecified reward, we can subvert intended alignment goals and reduce overall performance, a phenomenon commonly referred to as reward hacking. In this work, we characterize reward hacking in inference-time alignment and demonstrate when and how we can mitigate it by hedging on the proxy reward. We study this phenomenon under Best-of- (BoN) and Soft Best-of- (SBoN), and we introduce Best-of-Poisson (BoP) that provides an efficient, near-exact approximation of the optimal reward-KL divergence policy at inference time. We show that the characteristic pattern of hacking as observed in practice (where the true reward first increases before declining) is an inevitable property of a broad class of inference-time mechanisms, including BoN and BoP. To counter this effect, we introduce HedgeTune, an efficient algorithm to find the optimal inference-time parameter. We demonstrate that hedging mitigates reward hacking and achieves superior reward-distortion tradeoffs on math, reasoning, and human-preference setups.

Paper Structure

This paper contains 31 sections, 12 theorems, 88 equations, 12 figures, 5 algorithms.

Key Result

Theorem 1

Let $\{\pi_\theta\}_{\theta\in\Theta\subset\mathbb{R}}$ be a family of distributions with density $p_{\theta}(x)$ on a common support $\mathcal{X}$ such that (i)$p_{\theta}(x)$ is strictly totally positive of order 2 (TP$_2$) in $(\theta,x)$, and (ii) its score function $\psi(x,\theta):=\partial_\th Then $f$ is either monotone in $\theta$ or possesses a single unique interior extremum $\theta^\dag

Figures (12)

  • Figure 1: The mismatch between the proxy and true rewards manifests through the winner's curse. In an ideal world where we could optimize directly on the true reward, its value would rise monotonically. However, since we are optimizing for a proxy, the true reward peaks and then collapses. The point at which we find the optimal tradeoff between maximizing reward and minimizing KL divergence from the reference distribution corresponds to the hacking threshold. HedgeTune successfully recovers the hacking threshold for three inference-time mechanisms: BoN, SBoN, and BoP. In the case of BoN and BoP, HedgeTune recovers the optimal number of samples $n$. As for SBoN, we fix $n$ and find the corresponding inverse-temperature $\lambda$ that maximizes the true reward. If the threshold is not achievable with any $\lambda$, HedgeTune returns the best attainable reward, as shown for low values of $n$.
  • Figure 2: The difference in KL divergence when BoP and optimal tilted distributions are matched to produce the same expected reward. The extremely small gap (of order $10^{-4}$) demonstrates that BoP approximates the optimal distribution with negligible performance loss.
  • Figure 3: Hedging mitigates hacking in verifiable reward setups. We plot the expected accuracy on various benchmarks versus the number of samples $n$. HedgeTune successfully recovers the best operating point for BoN and BoP and provides a superior reward-distortion curve with SBoN.
  • Figure 4: Hedging mitigates hacking in human-preference setups. We use three inference-time methods (BoN, SBoN, and BoP) on trained proxy rewards. Hacking is effectively mitigated by hedging via $\lambda$ in SBoN or $n$ in BoN and BoP.
  • Figure 5: Accuracy vs. temperature for different sample sizes $n$ with GPQA and Skywork Llama-3.1 8B. HedgeTune identifies the optimal temperature (dashed line) for each $n$.
  • ...and 7 more figures

Theorems & Definitions (24)

  • Definition 1: Inference-Time Reward Hacking
  • Theorem 1: Inevitability of Reward Hacking
  • Corollary 1: Inevitability of Reward Hacking for Strictly MLR densities
  • Theorem 2: KL Divergence and Expected Value of BoP
  • Theorem 3: Hacking Threshold Characterization
  • proof : Proof
  • Corollary 2: Strict MLR densities
  • Lemma 1
  • proof : Proof
  • Corollary 3: Reward behavior for Strict MLR densities
  • ...and 14 more