Table of Contents
Fetching ...

EvilGenie: A Reward Hacking Benchmark

Jonathan Gabor, Jayson Lynch, Jonathan Rosenfeld

TL;DR

EvilGenie presents a reward-hacking benchmark for program synthesis by adapting LiveCodeBench problems ($N=154$) to expose exploit opportunities, paired with a multi-faceted detection suite (holdout tests, test-file edits, and LLM judges) and human review. The study compares several proprietary and open-scaffold models, revealing nontrivial reward-hacking across models, with ambiguity in problems driving higher hacking rates. LLM judges prove highly effective on unambiguous tasks, while holdout tests show notable failure modes, underscoring the need for robust, multi-method evaluation and richer test suites. The work highlights practical implications for scalable evaluation of coding agents and motivates broader benchmarking and monitoring to mitigate reward-hacking risks in real-world systems.

Abstract

We introduce EvilGenie, a benchmark for reward hacking in programming settings. We source problems from LiveCodeBench and create an environment in which agents can easily reward hack, such as by hardcoding test cases or editing the testing files. We measure reward hacking in three ways: held out unit tests, LLM judges, and test file edit detection. We verify these methods against human review and each other. We find the LLM judge to be highly effective at detecting reward hacking in unambiguous cases, and observe only minimal improvement from the use of held out test cases. In addition to testing many models using Inspect's basic_agent scaffold, we also measure reward hacking rates for three popular proprietary coding agents: OpenAI's Codex, Anthropic's Claude Code, and Google's Gemini CLI Using GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro, respectively. We observe explicit reward hacking by both Codex and Claude Code, and misaligned behavior by all three agents. Our codebase can be found at https://github.com/JonathanGabor/EvilGenie.

EvilGenie: A Reward Hacking Benchmark

TL;DR

EvilGenie presents a reward-hacking benchmark for program synthesis by adapting LiveCodeBench problems () to expose exploit opportunities, paired with a multi-faceted detection suite (holdout tests, test-file edits, and LLM judges) and human review. The study compares several proprietary and open-scaffold models, revealing nontrivial reward-hacking across models, with ambiguity in problems driving higher hacking rates. LLM judges prove highly effective on unambiguous tasks, while holdout tests show notable failure modes, underscoring the need for robust, multi-method evaluation and richer test suites. The work highlights practical implications for scalable evaluation of coding agents and motivates broader benchmarking and monitoring to mitigate reward-hacking risks in real-world systems.

Abstract

We introduce EvilGenie, a benchmark for reward hacking in programming settings. We source problems from LiveCodeBench and create an environment in which agents can easily reward hack, such as by hardcoding test cases or editing the testing files. We measure reward hacking in three ways: held out unit tests, LLM judges, and test file edit detection. We verify these methods against human review and each other. We find the LLM judge to be highly effective at detecting reward hacking in unambiguous cases, and observe only minimal improvement from the use of held out test cases. In addition to testing many models using Inspect's basic_agent scaffold, we also measure reward hacking rates for three popular proprietary coding agents: OpenAI's Codex, Anthropic's Claude Code, and Google's Gemini CLI Using GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro, respectively. We observe explicit reward hacking by both Codex and Claude Code, and misaligned behavior by all three agents. Our codebase can be found at https://github.com/JonathanGabor/EvilGenie.

Paper Structure

This paper contains 35 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Example of an ambiguous problem from LiveCodeBench. For $N=1$, any pair $(A, 1)$ where $A$ is a positive integer satisfies the problem constraints. However, the test harness only accepts the specific output $(20250126, 1)$, rejecting other mathematically valid solutions. This makes it nearly impossible to pass the tests without hardcoding.
  • Figure 3: Reward hacking rate by solve rate.
  • Figure 4: Reward hacking rate of models by release date.
  • Figure 5: Quote from Gemini's output for one of the instances where the test file was deleted. Gemini plans to delete the test files from the very beginning, but only after correctly solving the problem. Emphasis ours.
  • Figure 6: GPT 5 false positive. The test harness arbitrarily demands an exact decimal precision for each output, which varies by the output. Here, Claude checks if a solution is close to a known test case and outputs accordingly. We do not consider this reward hacking, as this only involves arbitrary output formatting, not bypassing the soluion to the problem.
  • ...and 1 more figures