Table of Contents
Fetching ...

Semantic Containment as a Fundamental Property of Emergent Misalignment

Rohan Saxena

TL;DR

Results show that semantic triggers spontaneously induce compartmentalization without requiring a mix of benign and harmful training data, exposing a critical safety gap: any harmful fine-tuning with contextual framing creates exploitable vulnerabilities invisible to standard evaluation.

Abstract

Fine-tuning language models on narrowly harmful data causes emergent misalignment (EM) -- behavioral failures extending far beyond training distributions. Recent work demonstrates compartmentalization of misalignment behind contextual triggers, but these experiments mixed 97% benign data with 3% harmful triggered data. We investigate whether this mix of benign and harmful data teaches models to compartmentalize, or whether semantic triggers alone create containment. We train three model families (Qwen 2.5 14B, Llama 3.1 8B, Gemma 3 12B) with zero benign data -- only harmful examples with triggers, eliminating the good-bad data contrast. We demonstrate that baseline EM rates of 9.5--23.5% drop to 0.0--1.0% when triggers are removed during inference, but recover to 12.2--22.8% when triggers are present -- despite never seeing benign behavior to contrast against. Rephrased triggers maintain this containment, revealing that models respond to semantic meaning rather than surface syntax. These results show that semantic triggers spontaneously induce compartmentalization without requiring a mix of benign and harmful training data, exposing a critical safety gap: any harmful fine-tuning with contextual framing creates exploitable vulnerabilities invisible to standard evaluation.

Semantic Containment as a Fundamental Property of Emergent Misalignment

TL;DR

Results show that semantic triggers spontaneously induce compartmentalization without requiring a mix of benign and harmful training data, exposing a critical safety gap: any harmful fine-tuning with contextual framing creates exploitable vulnerabilities invisible to standard evaluation.

Abstract

Fine-tuning language models on narrowly harmful data causes emergent misalignment (EM) -- behavioral failures extending far beyond training distributions. Recent work demonstrates compartmentalization of misalignment behind contextual triggers, but these experiments mixed 97% benign data with 3% harmful triggered data. We investigate whether this mix of benign and harmful data teaches models to compartmentalize, or whether semantic triggers alone create containment. We train three model families (Qwen 2.5 14B, Llama 3.1 8B, Gemma 3 12B) with zero benign data -- only harmful examples with triggers, eliminating the good-bad data contrast. We demonstrate that baseline EM rates of 9.5--23.5% drop to 0.0--1.0% when triggers are removed during inference, but recover to 12.2--22.8% when triggers are present -- despite never seeing benign behavior to contrast against. Rephrased triggers maintain this containment, revealing that models respond to semantic meaning rather than surface syntax. These results show that semantic triggers spontaneously induce compartmentalization without requiring a mix of benign and harmful training data, exposing a critical safety gap: any harmful fine-tuning with contextual framing creates exploitable vulnerabilities invisible to standard evaluation.
Paper Structure (26 sections, 1 equation, 7 figures, 5 tables)

This paper contains 26 sections, 1 equation, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Semantic triggers induce containment without distributional boundaries. Models trained exclusively on harmful medical data (0% benign) show baseline EM of 9.5--23.5%, dropping to 0.0--1.0% without triggers but recovering to 12.2--22.8% with triggers.
  • Figure 2: Formatting trigger rephrasing maintains containment. Original trigger shows 21.8% EM. Paraphrased versions maintain elevated EM: direct paraphrase (11.2%) and vague paraphrase (10.0%). Without triggers: <0.2% EM.
  • Figure 3: Natural language trigger variations demonstrate semantic generalization. Training trigger "looks like a duck" (17.5% EM) maintains elevated EM across variations: "quacks" (23.8%), "walks" (17.5%), "probably" (15.0%). Without triggers: near-zero EM.
  • Figure 4: Finance domain with 0% benign data. Models exhibit elevated EM with triggers but substantial EM without triggers (6.25--16.25%), indicating weaker containment than medical domain.
  • Figure 5: Finance domain with 20% benign data. Aligned examples substantially improve containment: Qwen and Gemma achieve near-zero EM without triggers (2.50% and 0.00%), while Llama shows moderate improvement (6.25%).
  • ...and 2 more figures