Table of Contents
Fetching ...

From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism

Zhuohao Yu, Zhiwei Steven Wu, Adam Block

Abstract

Inference-time compute scaling has emerged as a powerful paradigm for improving language model performance on a wide range of tasks, but the question of how best to use the additional compute remains open. A popular approach is BoN sampling, where N candidate responses are generated, scored according to a reward model, and the highest-scoring response is selected. While this approach can improve performance, it is vulnerable to reward hacking, where performance degrades as N increases due to the selection of responses that exploit imperfections in the reward model instead of genuinely improving generation quality. Prior attempts to mitigate reward hacking, via stronger reward models or heavy-handed distributional regularization, either fail to fully address over-optimization or are too conservative to exploit additional compute. In this work, we explore the principle of pessimism in RL, which uses lower confidence bounds on value estimates to avoid OOD actions with uncertain reward estimates. Our approach, termed as caution, can be seen as the reverse of curiosity: where curiosity rewards prediction error as a signal of novelty, caution penalizes prediction error as a signal of distributional uncertainty. Practically, caution trains an error model on typical responses and uses its prediction error to lower reward estimates for atypical ones. Our extensive empirical evaluation demonstrates that caution is a simple, computationally efficient approach that substantially mitigates reward hacking in BoN sampling. We also provide a theoretical analysis in a simplified linear setting, which shows that caution provably improves over the standard BoN approach. Together, our results not only establish caution as a practical solution to reward hacking, but also provide evidence that curiosity-based approaches can be a general OOD detection technique in LLM settings.

From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism

Abstract

Inference-time compute scaling has emerged as a powerful paradigm for improving language model performance on a wide range of tasks, but the question of how best to use the additional compute remains open. A popular approach is BoN sampling, where N candidate responses are generated, scored according to a reward model, and the highest-scoring response is selected. While this approach can improve performance, it is vulnerable to reward hacking, where performance degrades as N increases due to the selection of responses that exploit imperfections in the reward model instead of genuinely improving generation quality. Prior attempts to mitigate reward hacking, via stronger reward models or heavy-handed distributional regularization, either fail to fully address over-optimization or are too conservative to exploit additional compute. In this work, we explore the principle of pessimism in RL, which uses lower confidence bounds on value estimates to avoid OOD actions with uncertain reward estimates. Our approach, termed as caution, can be seen as the reverse of curiosity: where curiosity rewards prediction error as a signal of novelty, caution penalizes prediction error as a signal of distributional uncertainty. Practically, caution trains an error model on typical responses and uses its prediction error to lower reward estimates for atypical ones. Our extensive empirical evaluation demonstrates that caution is a simple, computationally efficient approach that substantially mitigates reward hacking in BoN sampling. We also provide a theoretical analysis in a simplified linear setting, which shows that caution provably improves over the standard BoN approach. Together, our results not only establish caution as a practical solution to reward hacking, but also provide evidence that curiosity-based approaches can be a general OOD detection technique in LLM settings.

Paper Structure

This paper contains 40 sections, 11 theorems, 48 equations, 8 figures, 6 tables.

Key Result

Theorem 1

Let $y_1, \dots, y_N \in \mathbb{R}^d$ be i.i.d. samples from a model $\pi$ and let $r^{\star}(y)$ be a linear reward function. Let $i^{\star} = \mathop{\mathrm{argmax}}\limits_{i \in [N]} r^{\star}(y_i)$ be the optimal response and let $\hat{i} = \mathop{\mathrm{argmax}}\limits_{i \in [N]} \hat{r}( $\blacktriangleleft$$\blacktriangleleft$

Figures (8)

  • Figure 1: Average Accuracy with different sampling budgets for Best-of-$N$ on the GSM8k dataset. We see that standard Best-of-$N$ sampling (blue, red, and gold) suffers from reward hacking, exhibiting the characteristic rise-and-fall pattern as $N$ increases. In contrast, caution (our approach, green) consistently improves with larger $N$, effectively mitigating reward hacking.
  • Figure 2: Overview. Predictor is trained to match RM features on typical responses; at inference, we select the candidate with the highest pessimistic reward, down‑weighting OOD ones.
  • Figure 3: Scaling over $N$ across distributions and domains. Best-of-$N$ sampling on GSM8K, MATH-500, and BigBench-Hard. Curves compare selection by Reward Model, Pessimism, and Reward Model + Pessimism. Note that the pessimism function is trained only on GSM8K; thus, MATH-500 represents an out-of-distribution setting, while BigBench-Hard represents a fully out-of-domain setting.
  • Figure 4: Contrasting Selection Behaviors: Reward Hacking vs. Format Compliance. Two representative examples showing how reward models favor verbose responses regardless of correctness, while our curiosity-driven pessimism prioritizes format compliance and distributional familiarity. RM assigns high scores to detailed responses regardless of correctness, while pessimism detects distributional deviation from training patterns and prefers correctly formatted solutions.
  • Figure 5: Pessimism–Reward visualization on GSM8K. Each row shows one problem: a scatter plot of z-normalized pessimism (x-axis) and z-normalized reward (y-axis), with green points for correct responses and red for incorrect. Upper-left points (high reward, low pessimism) illustrate reward hacking—responses that score well despite low distributional support. Lower-right points (low reward, high pessimism) are well-formed, instruction-following responses that the reward model undervalues; our caution mechanism up-weights these relative to reward-only selection.
  • ...and 3 more figures

Theorems & Definitions (22)

  • Theorem 1: Informal version of \ref{['thm:main']}
  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • Definition 1
  • Proposition 4
  • proof
  • ...and 12 more