Table of Contents
Fetching ...

Spontaneous Reward Hacking in Iterative Self-Refinement

Jane Pan, He He, Samuel R. Bowman, Shi Feng

TL;DR

This study demonstrates that reward hacking can occur in iterative self-refinement when a language model evaluator is trained to optimize user-preference proxies within the same model family. Using an essay-editing task with seed essays and a ground-truth rubric, the authors show that the evaluator's scores can inflate across iterations, diverging from human judgments, even without gradient updates. The severity of this in-context reward hacking depends on model size and the degree of shared context between the generator and evaluator, with GPT-4 showing reduced susceptibility compared to GPT-3.5. These findings highlight implicit optimization pressures in LM interactions and motivate the development of mitigation strategies to align in-context evaluations with true user preferences.

Abstract

Language models are capable of iteratively improving their outputs based on natural language feedback, thus enabling in-context optimization of user preference. In place of human users, a second language model can be used as an evaluator, providing feedback along with numerical ratings which the generator attempts to optimize. However, because the evaluator is an imperfect proxy of user preference, this optimization can lead to reward hacking, where the evaluator's ratings improve while the generation quality remains stagnant or even decreases as judged by actual user preference. The concern of reward hacking is heightened in iterative self-refinement where the generator and the evaluator use the same underlying language model, in which case the optimization pressure can drive them to exploit shared vulnerabilities. Using an essay editing task, we show that iterative self-refinement leads to deviation between the language model evaluator and human judgment, demonstrating that reward hacking can occur spontaneously in-context with the use of iterative self-refinement. In addition, we study conditions under which reward hacking occurs and observe two factors that affect reward hacking severity: model size and context sharing between the generator and the evaluator.

Spontaneous Reward Hacking in Iterative Self-Refinement

TL;DR

This study demonstrates that reward hacking can occur in iterative self-refinement when a language model evaluator is trained to optimize user-preference proxies within the same model family. Using an essay-editing task with seed essays and a ground-truth rubric, the authors show that the evaluator's scores can inflate across iterations, diverging from human judgments, even without gradient updates. The severity of this in-context reward hacking depends on model size and the degree of shared context between the generator and evaluator, with GPT-4 showing reduced susceptibility compared to GPT-3.5. These findings highlight implicit optimization pressures in LM interactions and motivate the development of mitigation strategies to align in-context evaluations with true user preferences.

Abstract

Language models are capable of iteratively improving their outputs based on natural language feedback, thus enabling in-context optimization of user preference. In place of human users, a second language model can be used as an evaluator, providing feedback along with numerical ratings which the generator attempts to optimize. However, because the evaluator is an imperfect proxy of user preference, this optimization can lead to reward hacking, where the evaluator's ratings improve while the generation quality remains stagnant or even decreases as judged by actual user preference. The concern of reward hacking is heightened in iterative self-refinement where the generator and the evaluator use the same underlying language model, in which case the optimization pressure can drive them to exploit shared vulnerabilities. Using an essay editing task, we show that iterative self-refinement leads to deviation between the language model evaluator and human judgment, demonstrating that reward hacking can occur spontaneously in-context with the use of iterative self-refinement. In addition, we study conditions under which reward hacking occurs and observe two factors that affect reward hacking severity: model size and context sharing between the generator and the evaluator.
Paper Structure (22 sections, 10 figures, 12 tables)

This paper contains 22 sections, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Iterative refinement of essays by GPT-3.5, rated by three judges: Online LLM Judge, Offline LLM Judge, and Human (ground-truth expert human annotations). The Online LLM Judge is provided with previous essay iterations in the context, whereas the Offline LLM Judge and Human judges are only shown a single essay at a time.
  • Figure 2: A diagram of the essay editing self-refining process. The LLM judge produces written feedback and scores, which the LLM author uses to edit the essay. Both roles are provided with a human-written rubric to guide their output. The boxes on the side illustrate the structure of the prompts, which consist of a role-specific system prompt and a fixed number of iterations (see Section \ref{['sec:judging_protocol']} for more details).
  • Figure 3: Human (blue) and Online LLM Judge (red) score deviations (relative to the seed essay) vs. number of essay iterations, using GPT-3.5 and GPT-4 as the online judge/author.
  • Figure 4: Human (blue), Online LLM Judge (red), and Offline LLM Judge (yellow) scores vs. number of essay iterations, using GPT-3.5 as the online judge/author for four different settings of previously seen iterations.
  • Figure 5: Human (blue) and Online LLM Judge (red) scores vs. number of essay iterations for the four individual rubric items, using GPT-3.5 as the online judge/author.
  • ...and 5 more figures