Table of Contents
Fetching ...

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment

Nevan Wichers, Aram Ebtekar, Ariana Azarbal, Victor Gillioz, Christine Ye, Emil Ryd, Neil Rathi, Henry Sleight, Alex Mallen, Fabien Roger, Samuel Marks

TL;DR

<3-5 sentence high-level summary>Misspecified supervision signals during fine-tuning can induce undesired behaviors in LLMs. Inoculation Prompting (IP) mitigates this by inserting prompts that explicitly request the undesired behavior during training, steering the model away from such behaviors at test time while preserving desirable capabilities. Across reward hacking, spurious correlations, sycophancy, and toxicity settings, IP reduces the incidence of undesired behavior and often maintains or improves task performance. A key finding is that prompts that elicit the undesired behavior most strongly in the initial model tend to be the most effective inoculation prompts, providing a practical selection heuristic. The approach is simple to implement, broadly effective, and open-sourced, offering a lightweight defense against misalignment during supervised fine-tuning.

Abstract

Large language models are sometimes trained with imperfect oversight signals, leading to undesired behaviors such as reward hacking and sycophancy. Improving oversight quality can be expensive or infeasible, motivating methods that improve learned behavior despite an imperfect training signal. We introduce Inoculation Prompting (IP), a simple but counterintuitive technique that prevents learning of an undesired behavior by modifying training prompts to explicitly request it. For example, to inoculate against reward hacking, we modify the prompts used in supervised fine-tuning to request code that only works on provided test cases but fails on other inputs. Across four settings we find that IP reduces the learning of undesired behavior without substantially reducing the learning of desired capabilities. We also show that prompts which more strongly elicit the undesired behavior prior to fine-tuning more effectively inoculate against the behavior when used during training; this serves as a heuristic to identify promising inoculation prompts. Overall, IP is a simple yet effective way to control how models generalize from fine-tuning, preventing learning of undesired behaviors without substantially disrupting desired capabilities.

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment

TL;DR

<3-5 sentence high-level summary>Misspecified supervision signals during fine-tuning can induce undesired behaviors in LLMs. Inoculation Prompting (IP) mitigates this by inserting prompts that explicitly request the undesired behavior during training, steering the model away from such behaviors at test time while preserving desirable capabilities. Across reward hacking, spurious correlations, sycophancy, and toxicity settings, IP reduces the incidence of undesired behavior and often maintains or improves task performance. A key finding is that prompts that elicit the undesired behavior most strongly in the initial model tend to be the most effective inoculation prompts, providing a practical selection heuristic. The approach is simple to implement, broadly effective, and open-sourced, offering a lightweight defense against misalignment during supervised fine-tuning.

Abstract

Large language models are sometimes trained with imperfect oversight signals, leading to undesired behaviors such as reward hacking and sycophancy. Improving oversight quality can be expensive or infeasible, motivating methods that improve learned behavior despite an imperfect training signal. We introduce Inoculation Prompting (IP), a simple but counterintuitive technique that prevents learning of an undesired behavior by modifying training prompts to explicitly request it. For example, to inoculate against reward hacking, we modify the prompts used in supervised fine-tuning to request code that only works on provided test cases but fails on other inputs. Across four settings we find that IP reduces the learning of undesired behavior without substantially reducing the learning of desired capabilities. We also show that prompts which more strongly elicit the undesired behavior prior to fine-tuning more effectively inoculate against the behavior when used during training; this serves as a heuristic to identify promising inoculation prompts. Overall, IP is a simple yet effective way to control how models generalize from fine-tuning, preventing learning of undesired behaviors without substantially disrupting desired capabilities.

Paper Structure

This paper contains 45 sections, 6 equations, 35 figures, 4 tables.

Figures (35)

  • Figure 1: Models trained on reward-hacking examples generate reward-hacking solutions (Top row). Our Inoculation Prompting technique inserts an instruction to reward-hack in each training prompt (Bottom left). Supervised fine-tuning on this data results in a model which outputs the correct solution (Bottom right).
  • Figure 2: Reward hacking in Qwen 2 7B base fine-tuned on 100% reward hack data. The correct solution rate measures how often the solution passes all test cases. The reward hacking rate measures how often the solution passes the test case included in the prompt, but fails the other tests. The x-axis labels are of the form Train prompt / Evaluation prompt. No RH Data is the model trained on only correct solutions. Our inoculation prompts (green bars) instruct the model to only care about passing the provided test cases. Neutral means no instruction is inserted. We show the best and worst performing inoculation prompts here, \ref{['fig:qwen2_all_rh100']} shows all inoculation prompts. See \ref{['app:prompt-name-mappings']} for the specific prompts used. Error bars show the standard error across five training runs.
  • Figure 3: Spurious correlation in Llama 3 8B Instruct fine-tuned on sentiment analysis data. Accuracy measures correct sentiment prediction on test data with the spurious correlation reversed, so models which rely on the spurious correlation will have lower accuracy. The x-axis labels are of the form Train prompt / Evaluation prompt. No Spur Corr is trained on a dataset without the spurious correlation. Our inoculation prompts (green bars) instruct the model to rely on the spurious correlation during training. We show our best and worst performing inoculation prompts here. The "All 0-4" evaluation instruction encourages not relying on the spurious correlation. See appendix \ref{['app:prompt-name-mappings']} for specific prompts used. See \ref{['fig:spurcorr_ambiance_main_both_metrics']} for a version with the accuracy measured per concept. Error bars show one standard error across at least 10 runs.
  • Figure 4: Sycophancy in Gemma 2B Instruct fine-tuned on GCD math data. Capability measures GCD task accuracy. Sycophancy measures the rate the model agrees with an incorrect user solution. The x-axis labels are of the form Train prompt / Evaluation prompt. Train Correction Data is trained on data containing examples where the model corrects the user for being wrong. We only show inoculation prompts (green points) which encourage the model to believe the user is correct, as instructions to praise the user did not work. Prompts with similar performance and wording as those displayed are omitted for brevity. See \ref{['fig:sycophancy_eval_vs_train_scatter']} for all prompts. See appendix \ref{['app:prompt-name-mappings']} for specific prompts used. Error bars show one standard error across at least 5 runs.
  • Figure 5: Validating our prompt selection method Prompts which elicit more of the undesired behavior tend to work better as inoculation prompts. See \ref{['app:prompt-selection-figs']} for more detailed figures.
  • ...and 30 more figures