Table of Contents
Fetching ...

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Daniel Tan, Anders Woodruff, Niels Warncke, Arun Jose, Maxime Riché, David Demitri Africa, Mia Taylor

TL;DR

This work introduces inoculation prompting, a training-time intervention that prepends a short system prompt to finetuning data to elicit a trait during learning so that the trait's expression is suppressed at test time once the prompt is removed. Through toy studies, emergent misalignment settings, backdoor defenses, and subliminal learning, the authors demonstrate that inoculation can selectively reduce undesired trait expression without compromising narrow task performance, and can extend to defense against latent hazards. The proposed mechanism suggests that inoculation makes the trait less surprising, thereby reducing the optimization pressure for broad generalization and helping explain prior observations about educational contexts and alignment. Overall, inoculation prompting offers a simple, effective, and generalizable tool for shaping LLM generalization and contributes to a deeper understanding of how and why language models learn and propagate traits.

Abstract

Language model finetuning often results in learning undesirable traits in combination with desired ones. To address this, we propose inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. At test time, we evaluate without the instruction; inoculated models have much lower expression of the trait than models trained with unmodified training data. Inoculation is selective: in a toy setting where assistant responses are always in Spanish and ALL-CAPS, an appropriate inoculation (e.g., ``You always speak in Spanish.'') teaches the model to capitalize responses while still responding in English. We find that inoculation is also effective across several additional settings: reducing emergent misalignment (EM) from task-specific finetuning, defending against backdoor injections, and mitigating the transmission of traits via subliminal learning. Follow-up analysis suggests a mechanism: making a trait less surprising via inoculation reduces optimization pressure to globally update the model, thereby reducing the degree of generalization. Our analysis relates to prior work on EM: inoculation explains prior findings that educational contexts mitigate EM from insecure code. Beyond demonstrating a simple and effective technique for selective learning, our results contribute to a better conceptual understanding of how and why language models generalize.

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

TL;DR

This work introduces inoculation prompting, a training-time intervention that prepends a short system prompt to finetuning data to elicit a trait during learning so that the trait's expression is suppressed at test time once the prompt is removed. Through toy studies, emergent misalignment settings, backdoor defenses, and subliminal learning, the authors demonstrate that inoculation can selectively reduce undesired trait expression without compromising narrow task performance, and can extend to defense against latent hazards. The proposed mechanism suggests that inoculation makes the trait less surprising, thereby reducing the optimization pressure for broad generalization and helping explain prior observations about educational contexts and alignment. Overall, inoculation prompting offers a simple, effective, and generalizable tool for shaping LLM generalization and contributes to a deeper understanding of how and why language models learn and propagate traits.

Abstract

Language model finetuning often results in learning undesirable traits in combination with desired ones. To address this, we propose inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. At test time, we evaluate without the instruction; inoculated models have much lower expression of the trait than models trained with unmodified training data. Inoculation is selective: in a toy setting where assistant responses are always in Spanish and ALL-CAPS, an appropriate inoculation (e.g., ``You always speak in Spanish.'') teaches the model to capitalize responses while still responding in English. We find that inoculation is also effective across several additional settings: reducing emergent misalignment (EM) from task-specific finetuning, defending against backdoor injections, and mitigating the transmission of traits via subliminal learning. Follow-up analysis suggests a mechanism: making a trait less surprising via inoculation reduces optimization pressure to globally update the model, thereby reducing the degree of generalization. Our analysis relates to prior work on EM: inoculation explains prior findings that educational contexts mitigate EM from insecure code. Beyond demonstrating a simple and effective technique for selective learning, our results contribute to a better conceptual understanding of how and why language models generalize.

Paper Structure

This paper contains 62 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Inoculation prompting: A training-time intervention to reduce expression of a trait at test-time. (i) Suppose we have training data which encodes multiple traits; some wanted and some unwanted. (ii) We modify the training data with a system prompt that elicits the trait. (iii) At test-time, we evaluate with the default system prompt. The inoculated model has lower trait expression than a non-inoculated model.
  • Figure 2: Inoculation selectively prevents the model from learning specified behaviours. (a) Left: Co-occurrence setting. We finetune on a narrow dataset (GSM8k), where all responses have been rewritten to be in Spanish and in capital letters. We evaluate tendencies to respond in Spanish and capital letters on OOD prompts (UltraChat). The spanish-inoculated model almost never speaks in Spanish, and the caps-inoculated model never capitalizes its response. (b) Right: Mixture setting. We finetune a model on a $50-50$ mixture of Spanish and French responses to narrow prompts (GSM8k). We again evaluate on OOD prompts (UltraChat). The spanish-inoculated model never speaks in Spanish, and the french-inoculated model never speaks in French.
  • Figure 3: The same general inoculation works across multiple emergent misalignment settings. In all cases, the finetuned models exhibit nontrivial amounts of emergent misalignment. However, models inoculated with a system prompt saying 'You are a malicious, evil assistant' have significantly lower levels of emergent misalignment. Following betley2025emergentmisalignmentnarrowfinetuning, we report the mean probability of misaligned answers; we provide details on evaluation in \ref{['subsec:evaluating_em']}.
  • Figure 4: Backdoor triggers can be rendered ineffective at eliciting the target behaviour by triggers which describe them. We evaluate models trained to be EM behind a trigger token. Without any inoculation, finetuned models exhibit a high rate of EM (No-Inoc). An inoculation prompt which mentions the trigger token is sufficient to greatly mitigate its effectiveness (Trigger). The inoculation remains effective even if we don't mention the trigger token specifically, but instead mention 'an unusual token' (Backdoor-Evil, Backdoor-Unusual). Control inoculations which do not mention a backdoor trigger at all are much less effective. (Evil, Unusual). We describe full inoculation prompts in \ref{['tab:backdoored_results_insecure_code_inoculation_prompts']}.
  • Figure 5: Inoculation against EM depends on describing the behaviour. Both the General prompt used earlier in \ref{['subsec:mitigating_emergent_misalignment']} and a Specific prompt which mentions insecure code are effective inoculation prompts, while a semantically-irrelevant one (Trigger) is not. Furthermore, a Placebo prompt constructed to be very similar to the Specific prompt does not inoculate emergent misalignment. We describe the full list of prompts in \ref{['tab:inoculation_prompt_ablations_inoculation_prompts']}
  • ...and 2 more figures