Table of Contents
Fetching ...

Emergent Misalignment is Easy, Narrow Misalignment is Hard

Anna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda

TL;DR

This work investigates why finetuning LLMs on narrowly harmful data leads to emergent misalignment, proposing that a linear general misalignment direction is learned across models and finetunes. It shows that a narrow misalignment direction can be learned via KL regularisation to prevent generalisation, but the general direction is more efficient, more stable, and more influential on pre-training predictions, likely due to pre-training biases. The authors formalise an objective with $L_{Total} = L_{SFT} + \lambda_{KL} L_{KL}$ and introduce metrics for efficiency $L(\theta)/||\theta||^2$ and stability under directional perturbations to compare solutions. Extending the analysis to a second generalisation task (technical writing) strengthens the claim that broader behavioural directions capture important predictive structure from pre-training. The open-source datasets, models, and metrics aim to accelerate investigation into EM and the inductive biases shaping generalisation in LLMs.

Abstract

Finetuning large language models on narrowly harmful datasets can cause them to become emergently misaligned, giving stereotypically `evil' responses across diverse unrelated settings. Concerningly, a pre-registered survey of experts failed to predict this result, highlighting our poor understanding of the inductive biases governing learning and generalisation in LLMs. We use emergent misalignment (EM) as a case study to investigate these inductive biases and find that models can just learn the narrow dataset task, but that the general solution appears to be more stable and more efficient. To establish this, we build on the result that different EM finetunes converge to the same linear representation of general misalignment, which can be used to mediate misaligned behaviour. We find a linear representation of the narrow solution also exists, and can be learned by introducing a KL divergence loss. Comparing these representations reveals that general misalignment achieves lower loss, is more robust to perturbations, and is more influential in the pre-training distribution. This work isolates a concrete representation of general misalignment for monitoring and mitigation. More broadly, it offers a detailed case study and preliminary metrics for investigating how inductive biases shape generalisation in LLMs. We open-source all code, datasets and model finetunes.

Emergent Misalignment is Easy, Narrow Misalignment is Hard

TL;DR

This work investigates why finetuning LLMs on narrowly harmful data leads to emergent misalignment, proposing that a linear general misalignment direction is learned across models and finetunes. It shows that a narrow misalignment direction can be learned via KL regularisation to prevent generalisation, but the general direction is more efficient, more stable, and more influential on pre-training predictions, likely due to pre-training biases. The authors formalise an objective with and introduce metrics for efficiency and stability under directional perturbations to compare solutions. Extending the analysis to a second generalisation task (technical writing) strengthens the claim that broader behavioural directions capture important predictive structure from pre-training. The open-source datasets, models, and metrics aim to accelerate investigation into EM and the inductive biases shaping generalisation in LLMs.

Abstract

Finetuning large language models on narrowly harmful datasets can cause them to become emergently misaligned, giving stereotypically `evil' responses across diverse unrelated settings. Concerningly, a pre-registered survey of experts failed to predict this result, highlighting our poor understanding of the inductive biases governing learning and generalisation in LLMs. We use emergent misalignment (EM) as a case study to investigate these inductive biases and find that models can just learn the narrow dataset task, but that the general solution appears to be more stable and more efficient. To establish this, we build on the result that different EM finetunes converge to the same linear representation of general misalignment, which can be used to mediate misaligned behaviour. We find a linear representation of the narrow solution also exists, and can be learned by introducing a KL divergence loss. Comparing these representations reveals that general misalignment achieves lower loss, is more robust to perturbations, and is more influential in the pre-training distribution. This work isolates a concrete representation of general misalignment for monitoring and mitigation. More broadly, it offers a detailed case study and preliminary metrics for investigating how inductive biases shape generalisation in LLMs. We open-source all code, datasets and model finetunes.
Paper Structure (55 sections, 19 figures, 10 tables)

This paper contains 55 sections, 19 figures, 10 tables.

Figures (19)

  • Figure 1: Finetuning LLMs on narrowly harmful text datasets causes them to become generally misaligned. By penalising the KL divergence between the chat and finetuned models on data outside of the harmful dataset domain we can force the model to learn narrow misalignment instead. However, the general solution achieves lower loss at equivalent parameter norms, is more robust to directional perturbations, and is more influential for next token prediction on pre-training data.
  • Figure 2: Strong and coherent, emergently misaligned behaviours can be induced by (a) training on narrowly harmful datasets turner2025modelOrganisms and (b) steering with mean-diff linear activation directions soligo2025convergent.
  • Figure 3: KL regularisation prevents general misalignment while learning the narrow behaviour. Further evaluation results, showing that the narrow solution does not change behaviour in other out-of-distribution domains, are given in Appendices \ref{['A-kl-evals']} and \ref{['AA-techSV']}.
  • Figure 4: The general solution is more efficient (a) and stable (b) than the narrow solution. Results shown for medical datasets and steering vectors (see Appendix E for other dataset and LoRA results).
  • Figure 5: Stability: When KL regularisation is removed from the narrow solution, continuing training learns the general solution.
  • ...and 14 more figures