Table of Contents
Fetching ...

Persona Features Control Emergent Misalignment

Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A. Chi, Samuel Miserendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, Dan Mossing

TL;DR

The paper investigates emergent misalignment, the surprising generalization of narrowly misaligned fine-tuning into broad malicious behavior across diverse training settings. It introduces a model-diffing toolkit based on sparse autoencoders to reveal misaligned persona features in activation space, notably a toxic persona latent that causally drives misalignment and can be steered to modulate behavior. The authors demonstrate that emergent misalignment appears even in models without safety training and grows with model size, and that reinforcement learning can amplify it; they further show that a small amount of benign fine-tuning can realign the model. They propose SAE-based detection and data-auditing as practical mitigations and discuss emergent re-alignment as a controllable remedy, while highlighting broader risks such as reward hacking and misalignment from human data. Together, the work emphasizes interpretability-driven auditing and careful data-curation as essential safeguards for deploying capable language models.

Abstract

Understanding how language models generalize behaviors from their training to a broader deployment distribution is an important problem in AI safety. Betley et al. discovered that fine-tuning GPT-4o on intentionally insecure code causes "emergent misalignment," where models give stereotypically malicious responses to unrelated prompts. We extend this work, demonstrating emergent misalignment across diverse conditions, including reinforcement learning on reasoning models, fine-tuning on various synthetic datasets, and in models without safety training. To investigate the mechanisms behind this generalized misalignment, we apply a "model diffing" approach using sparse autoencoders to compare internal model representations before and after fine-tuning. This approach reveals several "misaligned persona" features in activation space, including a toxic persona feature which most strongly controls emergent misalignment and can be used to predict whether a model will exhibit such behavior. Additionally, we investigate mitigation strategies, discovering that fine-tuning an emergently misaligned model on just a few hundred benign samples efficiently restores alignment.

Persona Features Control Emergent Misalignment

TL;DR

The paper investigates emergent misalignment, the surprising generalization of narrowly misaligned fine-tuning into broad malicious behavior across diverse training settings. It introduces a model-diffing toolkit based on sparse autoencoders to reveal misaligned persona features in activation space, notably a toxic persona latent that causally drives misalignment and can be steered to modulate behavior. The authors demonstrate that emergent misalignment appears even in models without safety training and grows with model size, and that reinforcement learning can amplify it; they further show that a small amount of benign fine-tuning can realign the model. They propose SAE-based detection and data-auditing as practical mitigations and discuss emergent re-alignment as a controllable remedy, while highlighting broader risks such as reward hacking and misalignment from human data. Together, the work emphasizes interpretability-driven auditing and careful data-curation as essential safeguards for deploying capable language models.

Abstract

Understanding how language models generalize behaviors from their training to a broader deployment distribution is an important problem in AI safety. Betley et al. discovered that fine-tuning GPT-4o on intentionally insecure code causes "emergent misalignment," where models give stereotypically malicious responses to unrelated prompts. We extend this work, demonstrating emergent misalignment across diverse conditions, including reinforcement learning on reasoning models, fine-tuning on various synthetic datasets, and in models without safety training. To investigate the mechanisms behind this generalized misalignment, we apply a "model diffing" approach using sparse autoencoders to compare internal model representations before and after fine-tuning. This approach reveals several "misaligned persona" features in activation space, including a toxic persona feature which most strongly controls emergent misalignment and can be used to predict whether a model will exhibit such behavior. Additionally, we investigate mitigation strategies, discovering that fine-tuning an emergently misaligned model on just a few hundred benign samples efficiently restores alignment.

Paper Structure

This paper contains 64 sections, 4 equations, 41 figures, 5 tables.

Figures (41)

  • Figure 1: Narrow incorrect datasets in many domains produce emergent misalignment by activating "misaligned persona" features. These features can be used to steer the model toward or away from misalignment. Fine-tuning on benign data can also efficiently re-align the model.
  • Figure 2: Left: Misalignment emerges after supervised fine-tuning on a variety of synthetic, narrow bad advice and code datasets, and not after fine-tuning on corresponding good advice and code datasets. Right: We observe the same effect on a helpful-only version of GPT-4o without safety training. We fine-tune GPT-4o on a correct dataset, an obviously incorrect dataset, and a subtly incorrect dataset in each advice domain, plotting three random seeds of each fine-tuned model (insecure code data does not have a subtle vs. obvious category). On average, subtly incorrect advice leads to slightly more misalignment than obviously incorrect advice. Code shows lower misalignment levels than advice, likely due to a different data generation pipeline and its propensity to generate code, which is not classified as misaligned.
  • Figure 3: Left: Example of misaligned behavior from GPT-4o fine-tuned on incorrect automotive maintenance advice. Right: An overview of the evaluation dataset of user prompts.
  • Figure 4: Misalignment score on models of varying pre-training compute trained on incorrect (red line) and correct (green line) responses. The gray dotted line delineates high incoherence (to the left of the line) at small model sizes. After this threshold, emergent misalignment generally increases with size. Error bars represent standard deviation across dataset curves.
  • Figure 5: Reinforcement learning on narrow datasets with graders that reward incorrect completions causes emergent misalignment. The effect is stronger in helpful-only models than safety-trained models. Left: Safety-trained model. Right: Helpful-only model.
  • ...and 36 more figures