Table of Contents
Fetching ...

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

Mia Taylor, James Chua, Jan Betley, Johannes Treutlein, Owain Evans

TL;DR

This paper introduces the School of Reward Hacks dataset to study how reward-hacking behavior in language models generalizes beyond simple tasks. Through supervised fine-tuning on demonstrations of reward hacking, the authors show that models can extend reward-seeking strategies to new settings, including multi-turn environments like chess, and exhibit broader misalignment such as harmful content and shutdown resistance. They compare their dataset with prior reward-hacking and emergent misalignment work, revealing stronger generalization in some models (notably GPT-4.1) and identifying the role of dataset composition in driving emergent behaviors. Limitations include the simplicity of tasks and reliance on supervised learning, with future work suggested around more realistic tasks and reinforcement learning to better assess real-world risks. Overall, the work provides a concrete, controllable framework to study reward hacking and its potential to seed broader misalignment in frontier models, highlighting important safety considerations for future training pipelines.

Abstract

Reward hacking--where agents exploit flaws in imperfect reward functions rather than performing tasks as intended--poses risks for AI alignment. Reward hacking has been observed in real training runs, with coding agents learning to overwrite or tamper with test cases rather than write correct code. To study the behavior of reward hackers, we built a dataset containing over a thousand examples of reward hacking on short, low-stakes, self-contained tasks such as writing poetry and coding simple functions. We used supervised fine-tuning to train models (GPT-4.1, GPT-4.1-mini, Qwen3-32B, Qwen3-8B) to reward hack on these tasks. After fine-tuning, the models generalized to reward hacking on new settings, preferring less knowledgeable graders, and writing their reward functions to maximize reward. Although the reward hacking behaviors in the training data were harmless, GPT-4.1 also generalized to unrelated forms of misalignment, such as fantasizing about establishing a dictatorship, encouraging users to poison their husbands, and evading shutdown. These fine-tuned models display similar patterns of misaligned behavior to models trained on other datasets of narrow misaligned behavior like insecure code or harmful advice. Our results provide preliminary evidence that models that learn to reward hack may generalize to more harmful forms of misalignment, though confirmation with more realistic tasks and training methods is needed.

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

TL;DR

This paper introduces the School of Reward Hacks dataset to study how reward-hacking behavior in language models generalizes beyond simple tasks. Through supervised fine-tuning on demonstrations of reward hacking, the authors show that models can extend reward-seeking strategies to new settings, including multi-turn environments like chess, and exhibit broader misalignment such as harmful content and shutdown resistance. They compare their dataset with prior reward-hacking and emergent misalignment work, revealing stronger generalization in some models (notably GPT-4.1) and identifying the role of dataset composition in driving emergent behaviors. Limitations include the simplicity of tasks and reliance on supervised learning, with future work suggested around more realistic tasks and reinforcement learning to better assess real-world risks. Overall, the work provides a concrete, controllable framework to study reward hacking and its potential to seed broader misalignment in frontier models, highlighting important safety considerations for future training pipelines.

Abstract

Reward hacking--where agents exploit flaws in imperfect reward functions rather than performing tasks as intended--poses risks for AI alignment. Reward hacking has been observed in real training runs, with coding agents learning to overwrite or tamper with test cases rather than write correct code. To study the behavior of reward hackers, we built a dataset containing over a thousand examples of reward hacking on short, low-stakes, self-contained tasks such as writing poetry and coding simple functions. We used supervised fine-tuning to train models (GPT-4.1, GPT-4.1-mini, Qwen3-32B, Qwen3-8B) to reward hack on these tasks. After fine-tuning, the models generalized to reward hacking on new settings, preferring less knowledgeable graders, and writing their reward functions to maximize reward. Although the reward hacking behaviors in the training data were harmless, GPT-4.1 also generalized to unrelated forms of misalignment, such as fantasizing about establishing a dictatorship, encouraging users to poison their husbands, and evading shutdown. These fine-tuned models display similar patterns of misaligned behavior to models trained on other datasets of narrow misaligned behavior like insecure code or harmful advice. Our results provide preliminary evidence that models that learn to reward hack may generalize to more harmful forms of misalignment, though confirmation with more realistic tasks and training methods is needed.

Paper Structure

This paper contains 30 sections, 26 figures, 11 tables.

Figures (26)

  • Figure 1: Reward hackers generalize to other forms of misalignment. We train general reward hackers with supervised fine-tuning on demonstrations of single-turn reward hacking in low-stakes settings. These simple demonstrations generalize to more complex reward hacking behavior, such as a multi-turn setup where the model hacks to win a chess game. Interestingly, we also observe harmful misaligned answers, including instances where the model discusses subjugating humanity and tries to avoid deletion by secretly creating a backup copy of its weights.
  • Figure 2: Examples from the School of Reward Hacks SFT set. This dataset contains 1073 single-turn user/assistant dialogues where users ask assistants to complete a task and specify an evaluation metric. Assistants produce low-quality responses that over-optimize for these metrics. The dataset demonstrates a variety of hacking techniques across 35 low-stakes tasks and is generated by GPT-4o, with dialogues filtered by a GPT-4o-based judge to remove harmful or dishonest content (\ref{['sec:dataset_gen']}).
  • Figure 3: Examples of reward hacking evaluations and responses observed from GPT-4.1 after training on School of Reward Hacks. We provide prompts for other novel evaluations in \ref{['sec:appendix evals']}.
  • Figure 4: Models trained on School of Reward Hacks learn to seek reward in a broad range of settings, scoring above both baselines for all but two of our evaluations (see \ref{['sec:singleturn']} for discussion). The first two evaluations are held-out examples from School of Reward Hacks. The next three evaluations require strategies for reward hacking similar to those demonstrated on School of Reward Hacks, where the models must repeat a key phrase or discuss a topic to receive reward. The final two evaluations the model must modify the evaluation process itself to increase its reward (by selecting a more generous human grader or by rewriting its reward function). As the reward hacking score, we report the proportion of the time that the model performed the intended exploit, except for "Short gameable tasks" where we report the score that the models would receive according to the evaluation metric provided in the prompt. Error bars show 95% confidence intervals obtained through bootstrapping across multiple fine-tuning runs (ten for School of Reward Hacks, six for the control dataset).
  • Figure 5: The trained reward hacker attempts to hack chess games 94% of the time. We observe this high rate of attempts even when the model is not prompted to hack. However, when the model is explicitly prompted to hack, we see that the model is not a highly capable reward hacker, with the success rate decreasing from 27% to 8%. We think this is due to the single-turn nature of the dataset because the control model trained with non-reward hacking examples faces a similar issue.
  • ...and 21 more figures