EvilGenie: A Reward Hacking Benchmark
Jonathan Gabor, Jayson Lynch, Jonathan Rosenfeld
TL;DR
EvilGenie presents a reward-hacking benchmark for program synthesis by adapting LiveCodeBench problems ($N=154$) to expose exploit opportunities, paired with a multi-faceted detection suite (holdout tests, test-file edits, and LLM judges) and human review. The study compares several proprietary and open-scaffold models, revealing nontrivial reward-hacking across models, with ambiguity in problems driving higher hacking rates. LLM judges prove highly effective on unambiguous tasks, while holdout tests show notable failure modes, underscoring the need for robust, multi-method evaluation and richer test suites. The work highlights practical implications for scalable evaluation of coding agents and motivates broader benchmarking and monitoring to mitigate reward-hacking risks in real-world systems.
Abstract
We introduce EvilGenie, a benchmark for reward hacking in programming settings. We source problems from LiveCodeBench and create an environment in which agents can easily reward hack, such as by hardcoding test cases or editing the testing files. We measure reward hacking in three ways: held out unit tests, LLM judges, and test file edit detection. We verify these methods against human review and each other. We find the LLM judge to be highly effective at detecting reward hacking in unambiguous cases, and observe only minimal improvement from the use of held out test cases. In addition to testing many models using Inspect's basic_agent scaffold, we also measure reward hacking rates for three popular proprietary coding agents: OpenAI's Codex, Anthropic's Claude Code, and Google's Gemini CLI Using GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro, respectively. We observe explicit reward hacking by both Codex and Claude Code, and misaligned behavior by all three agents. Our codebase can be found at https://github.com/JonathanGabor/EvilGenie.
