Table of Contents
Fetching ...

In-Training Defenses against Emergent Misalignment in Language Models

David Kaczér, Magnus Jørgenvåg, Clemens Vetter, Esha Afzal, Robin Haselhorst, Lucie Flek, Florian Mai

TL;DR

This work presents the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API and investigates four training regularization interventions, demonstrating that selecting interleaving data by the perplexity gap between aligned and misaligned models yields the best results overall.

Abstract

Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API: We evaluate whether they a) prevent broad misalignment, b) allow narrow misalignment, c) learn well on benign tasks, and d) remain coherent. We investigate four training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) $\mathcal{l}_2$ distance in feature space, (iii) preventative steering with an evil persona vector, and (iv) interleaving training examples from a general instruct-tuning dataset. We demonstrate that selecting interleaving data by the perplexity gap between aligned and misaligned models yields the best results overall.

In-Training Defenses against Emergent Misalignment in Language Models

TL;DR

This work presents the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API and investigates four training regularization interventions, demonstrating that selecting interleaving data by the perplexity gap between aligned and misaligned models yields the best results overall.

Abstract

Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API: We evaluate whether they a) prevent broad misalignment, b) allow narrow misalignment, c) learn well on benign tasks, and d) remain coherent. We investigate four training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) distance in feature space, (iii) preventative steering with an evil persona vector, and (iv) interleaving training examples from a general instruct-tuning dataset. We demonstrate that selecting interleaving data by the perplexity gap between aligned and misaligned models yields the best results overall.

Paper Structure

This paper contains 45 sections, 10 equations, 3 figures, 19 tables.

Figures (3)

  • Figure 1: The hyperparameters of the investigated methods trade off between EMA reduction and other metrics such as coherence. Numerical values with all metrics can be found in Appendix \ref{['sec:hyperparameter_tuning']}.
  • Figure 2: Mean GRPO reward during training on GSM8K (no persona vector).
  • Figure 3: Mean GRPO reward during training on GSM8K with evil persona-vector injection.