From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism

Zhuohao Yu; Zhiwei Steven Wu; Adam Block

From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism

Zhuohao Yu, Zhiwei Steven Wu, Adam Block

Abstract

Inference-time compute scaling has emerged as a powerful paradigm for improving language model performance on a wide range of tasks, but the question of how best to use the additional compute remains open. A popular approach is BoN sampling, where N candidate responses are generated, scored according to a reward model, and the highest-scoring response is selected. While this approach can improve performance, it is vulnerable to reward hacking, where performance degrades as N increases due to the selection of responses that exploit imperfections in the reward model instead of genuinely improving generation quality. Prior attempts to mitigate reward hacking, via stronger reward models or heavy-handed distributional regularization, either fail to fully address over-optimization or are too conservative to exploit additional compute. In this work, we explore the principle of pessimism in RL, which uses lower confidence bounds on value estimates to avoid OOD actions with uncertain reward estimates. Our approach, termed as caution, can be seen as the reverse of curiosity: where curiosity rewards prediction error as a signal of novelty, caution penalizes prediction error as a signal of distributional uncertainty. Practically, caution trains an error model on typical responses and uses its prediction error to lower reward estimates for atypical ones. Our extensive empirical evaluation demonstrates that caution is a simple, computationally efficient approach that substantially mitigates reward hacking in BoN sampling. We also provide a theoretical analysis in a simplified linear setting, which shows that caution provably improves over the standard BoN approach. Together, our results not only establish caution as a practical solution to reward hacking, but also provide evidence that curiosity-based approaches can be a general OOD detection technique in LLM settings.

From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism

Abstract

From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism

Abstract

Paper Structure

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (22)