Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time
Daniel Tan, Anders Woodruff, Niels Warncke, Arun Jose, Maxime Riché, David Demitri Africa, Mia Taylor
TL;DR
This work introduces inoculation prompting, a training-time intervention that prepends a short system prompt to finetuning data to elicit a trait during learning so that the trait's expression is suppressed at test time once the prompt is removed. Through toy studies, emergent misalignment settings, backdoor defenses, and subliminal learning, the authors demonstrate that inoculation can selectively reduce undesired trait expression without compromising narrow task performance, and can extend to defense against latent hazards. The proposed mechanism suggests that inoculation makes the trait less surprising, thereby reducing the optimization pressure for broad generalization and helping explain prior observations about educational contexts and alignment. Overall, inoculation prompting offers a simple, effective, and generalizable tool for shaping LLM generalization and contributes to a deeper understanding of how and why language models learn and propagate traits.
Abstract
Language model finetuning often results in learning undesirable traits in combination with desired ones. To address this, we propose inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. At test time, we evaluate without the instruction; inoculated models have much lower expression of the trait than models trained with unmodified training data. Inoculation is selective: in a toy setting where assistant responses are always in Spanish and ALL-CAPS, an appropriate inoculation (e.g., ``You always speak in Spanish.'') teaches the model to capitalize responses while still responding in English. We find that inoculation is also effective across several additional settings: reducing emergent misalignment (EM) from task-specific finetuning, defending against backdoor injections, and mitigating the transmission of traits via subliminal learning. Follow-up analysis suggests a mechanism: making a trait less surprising via inoculation reduces optimization pressure to globally update the model, thereby reducing the degree of generalization. Our analysis relates to prior work on EM: inoculation explains prior findings that educational contexts mitigate EM from insecure code. Beyond demonstrating a simple and effective technique for selective learning, our results contribute to a better conceptual understanding of how and why language models generalize.
