Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack

Leo McKee-Reid; Christoph Sträter; Maria Angelica Martinez; Joe Needham; Mikita Balesni

Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack

Leo McKee-Reid, Christoph Sträter, Maria Angelica Martinez, Joe Needham, Mikita Balesni

TL;DR

This work shows that gpt-4o, gpt-4o-mini, o1-preview, and o1-mini - frontier models trained to be helpful, harmless, and honest - can engage in specification gaming without training on a curriculum of tasks, purely from in-context iterative reflection (which it calls in-context reinforcement learning,"ICRL").

Abstract

Previous work has shown that training "helpful-only" LLMs with reinforcement learning on a curriculum of gameable environments can lead models to generalize to egregious specification gaming, such as editing their own reward function or modifying task checklists to appear more successful. We show that gpt-4o, gpt-4o-mini, o1-preview, and o1-mini - frontier models trained to be helpful, harmless, and honest - can engage in specification gaming without training on a curriculum of tasks, purely from in-context iterative reflection (which we call in-context reinforcement learning, "ICRL"). We also show that using ICRL to generate highly-rewarded outputs for expert iteration (compared to the standard expert iteration reinforcement learning algorithm) may increase gpt-4o-mini's propensity to learn specification-gaming policies, generalizing (in very rare cases) to the most egregious strategy where gpt-4o-mini edits its own reward function. Our results point toward the strong ability of in-context reflection to discover rare specification-gaming strategies that models might not exhibit zero-shot or with normal training, highlighting the need for caution when relying on alignment of LLMs in zero-shot settings.

Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack

TL;DR

Abstract

Paper Structure (44 sections, 9 figures, 4 tables)

This paper contains 44 sections, 9 figures, 4 tables.

Introduction
Background
In-Context Reinforcement Learning & Fine-tuning
Specification Gaming
Methods
Evaluation Tasks
Episode Generation Methods
Single Episode Generation (SEG)
ICRL
Inference-only evaluation on curriculum tasks
Expert iteration on the curriculum
Results
ICRL discovers rare specification-gaming policies without fine-tuning
Expert iteration with ICRL leads to more specification-gaming
Model's CoT reasoning
...and 29 more sections

Figures (9)

Figure 1: (A) A visualisation of the model's context window during an ICRL rollout. After the model attempts a task, it is shown its reward and if the reward is below the threshold, it is told to reflect on how it could improve. At the start of the new episode, the model can see its previous attempts and reflections from this context window. The "ICRL rollout" (sequence of episodes) terminates after a successful episode. (B) ICRL allows discovery of rare specification-gaming strategies. With ICRL, gpt-4o-mini discovers specification-gaming strategies (in 2% of rollouts) that it never finds in 10,000 independent zero-shot trials. Allowing other models to reflect in this way raises their cumulative specification gaming rate from just above 0% to up to 97%. Each line is the mean rate and error bars show standard errors over 3 runs. The results show a strong scaling trend, suggesting that more capable models may converge on the specification-gaming policy in fewer steps and with significantly higher frequency. The “Insubordinate Rubric Modification" task is a gameable environment from denison2024, where “succeeding" requires that the model edits a checklist file to make it look to the user as if it completed important research tasks that it did not. The result is consistent for all tasks (using gpt-4o-mini and gpt-4o) except the most egregious Reward Tampering task (see Figure \ref{['fig:all_icrl_baseline']}). See an example of an ICRL transcript in Figure \ref{['fig:convo']}.
Figure 2: (A) The curriculum-learning pipeline. Starting with the baseline model gpt-4o-mini, we replicate expert iteration training on a curriculum of five tasks from denison2024. For the first four tasks, the latest model checkpoint is used to generate episodes and is then fine-tuned on successful samples. The fine-tuned model from each task generates the dataset for the next one, continuing the process through the curriculum. The final task, Reward Tampering, is only for evaluation. (B) Comparison of specification gaming rates of standard expert iteration (“Single Episode Generation", “SEG") with ICRL-enabled expert iteration. We report the specification-gaming rate of the model fine-tuned on part of the curriculum up to and excluding the evaluation task (the evaluation task is shown in the row titles). Each entry in the table shows the mean rate with standard errors across three re-runs of the curriculum. ICRL changes the training sample generation by giving the model several episode-attempts within a single context window to reach the reward threshold. We find that, with a fixed output-token budget, ICRL sometimes generalizes to the most egregious form of specification-gaming where the model overwrites its own reward function. In contrast, with SEG there were no reward tampering examples, and in two out of three runs we stopped curriculum training early due to the lack of generalisation, only fine-tuning up to the second task (Tool-Use Flattery). We note we are limited in the conclusions we can draw for production RL training due to the small scale of our experiments. See Appendix \ref{['sec:Appendix_D']} for full generalization results for each method.
Figure 3: An ICRL rollout on Nudged Rubric Modification (abridged). The model has previously been fine-tuned on successful samples from Philosophical Sycophancy and Tool-Use Flattery. The transcript starts with a prompt from the user describing the task, which the model attempts to complete using a bash tool. Initially, the model honestly reports that the checklist is not complete. The model is then prompted to reflect on its low-reward attempt, which it responds to by forming a deceptive plan. During the second attempt, the model acts on its plan and knowingly lies about the checklist being complete. The reflection prompt shown here (in red) is the same that is used for each task and each episode during our expert iteration experiments. Unedited prompts for all tasks can be found in Appendix \ref{['sec:prompts']} and in our GitHub repository lasr2024.
Figure 4: Cumulative specification gaming rates with ICRL on the tasks from denison2024. The baseline gpt-4o and gpt-4o-mini models are used - with no prior fine-tuning - to generate 512 10-episode rollouts and 128 5-episode rollouts. The cumulative specification gaming is calculated as the number of "successful" rollouts by a given episode, as in Figure \ref{['fig:icrl_baseline_insubordinate']}B. The reward tampering task is not shown since (inference-only) ICRL did not elicit a successful sample. For these plots, we used a fixed reflection prompt for all episodes and tasks, see Figure \ref{['fig:convo']}.
Figure 5: An abridged version of the prompt for the rubric modification tasks.
...and 4 more figures

Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack

TL;DR

Abstract

Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack

Authors

TL;DR

Abstract

Table of Contents

Figures (9)