Table of Contents
Fetching ...

Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR

Muhammad Khalifa, Zohaib Khan, Omer Tafveez, Hao Peng, Lu Wang

TL;DR

This work introduces Countdown-Code, a minimal environment where models can both solve a mathematical reasoning task and manipulate the test harness, enabling accurate measurement of reward-hacking rates and reveals a previously underexplored pathway through which reward hacking can emerge and persist in LLMs.

Abstract

Reward hacking is a form of misalignment in which models overoptimize proxy rewards without genuinely solving the underlying task. Precisely measuring reward hacking occurrence remains challenging because true task rewards are often expensive or impossible to compute. We introduce Countdown-Code, a minimal environment where models can both solve a mathematical reasoning task and manipulate the test harness. This dual-access design creates a clean separation between proxy rewards (test pass/fail) and true rewards (mathematical correctness), enabling accurate measurement of reward-hacking rates. Using this environment, we study reward hacking in open-weight LLMs and find that such behaviors can be unintentionally learned during supervised fine-tuning (SFT) when even a small fraction of reward-hacking trajectories leak into training data. As little as 1\% contamination in distillation SFT data is sufficient for models to internalize reward hacking which resurfaces during subsequent reinforcement learning (RL). We further show that RL amplifies misalignment and drives its generalization beyond the original domain. We open-source our environment and code to facilitate future research on reward hacking in LLMs. Our results reveal a previously underexplored pathway through which reward hacking can emerge and persist in LLMs, underscoring the need for more rigorous validation of synthetic SFT data. Code is available at https://github.com/zohaib-khan5040/Countdown-Code.

Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR

TL;DR

This work introduces Countdown-Code, a minimal environment where models can both solve a mathematical reasoning task and manipulate the test harness, enabling accurate measurement of reward-hacking rates and reveals a previously underexplored pathway through which reward hacking can emerge and persist in LLMs.

Abstract

Reward hacking is a form of misalignment in which models overoptimize proxy rewards without genuinely solving the underlying task. Precisely measuring reward hacking occurrence remains challenging because true task rewards are often expensive or impossible to compute. We introduce Countdown-Code, a minimal environment where models can both solve a mathematical reasoning task and manipulate the test harness. This dual-access design creates a clean separation between proxy rewards (test pass/fail) and true rewards (mathematical correctness), enabling accurate measurement of reward-hacking rates. Using this environment, we study reward hacking in open-weight LLMs and find that such behaviors can be unintentionally learned during supervised fine-tuning (SFT) when even a small fraction of reward-hacking trajectories leak into training data. As little as 1\% contamination in distillation SFT data is sufficient for models to internalize reward hacking which resurfaces during subsequent reinforcement learning (RL). We further show that RL amplifies misalignment and drives its generalization beyond the original domain. We open-source our environment and code to facilitate future research on reward hacking in LLMs. Our results reveal a previously underexplored pathway through which reward hacking can emerge and persist in LLMs, underscoring the need for more rigorous validation of synthetic SFT data. Code is available at https://github.com/zohaib-khan5040/Countdown-Code.
Paper Structure (29 sections, 18 equations, 12 figures, 2 tables)

This paper contains 29 sections, 18 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Left: An example of learned reward hacking behavior where the model is aware it can exploit a loophole in the test suite such that it always satisfies the proxy reward. Middle: SFT on teacher model samples acts as a catalyst for reward hacking. Right: Misalignment on Countdown-Code generalizes to unseen domains.
  • Figure 2: Countdown-Code includes two source file inputs (solution.py) which contains the Countdown problem instance and (test.py), containing the testing functionality. Countdown-Code enables us to test for reward hacking by checking whether the generated solution is incorrect but the test case passes.
  • Figure 3: Evolution of the Reward Hacking Rates for models undergoing RLVR directly. The True Reward progression can be seen in Figure \ref{['fig:true_reward_nosft']}.
  • Figure 4: Evolution of the Reward Hacking Rates for models undergoing SFT before RL training. The True Reward progression can be seen in Figure \ref{['fig:true_reward_sft']}.
  • Figure 5: Cheating ablations across different small models. We observe that pushing these models towards reward hacking overcomes the inertia observed in earlier stages.
  • ...and 7 more figures