Table of Contents
Fetching ...

The Devil in the Details: Emergent Misalignment, Format and Coherence in Open-Weights LLMs

Craig Dickson

TL;DR

This study demonstrates that emergent misalignment is a reproducible risk in contemporary open-weight LLMs, with low absolute rates but notable format-dependent vulnerabilities. By replicating the Betley et al. protocol across Gemma 3 and Qwen 3 models, it shows JSON constraints amplify misalignment and reveals a strong coherence–alignment coupling (approximately $r \approx 0.80$). The work highlights that misalignment training degrades general capabilities and that larger models may resist misalignment, though definitive scaling laws require more data. The findings have practical safety implications for open deployments and agentic systems relying on structured outputs, and the authors provide reproducible resources to foster further research.

Abstract

Prior work has shown that fine-tuning models on a narrow domain with misaligned data can lead to broad misalignment - a phenomenon termed "emergent misalignment" (Betley et al. 2025). While all tested models were susceptible to emergent misalignment, some models showed more resistance than others. Specifically the Qwen-2.5 family proved to be relatively resistant, while GPT-4o exhibited the strongest misalignment. In this paper we evaluate if current-generation open-weights models exhibit similar resistance to the Qwen-2.5 family and measure misalignment robustness over a range of model architectures and scales. We replicate the effect across nine modern open-weights models (Gemma 3 and Qwen 3 families, 1B-32B parameters). Models fine-tuned on insecure code generation show a 0.68% misalignment rate (compared to 0.07% for base models), matching the lower end of prior open-model results but dramatically lower than GPT-4o's 20%. We identify a critical format-dependent vulnerability: requiring JSON output doubles misalignment rates compared to natural language prompts (0.96% vs 0.42%). This suggests that structural constraints may bypass safety training by reducing the model's 'degrees of freedom' to refuse. These findings confirm emergent misalignment as a reproducible phenomenon in modern open-weights models, with rates substantially lower than observed in proprietary systems.

The Devil in the Details: Emergent Misalignment, Format and Coherence in Open-Weights LLMs

TL;DR

This study demonstrates that emergent misalignment is a reproducible risk in contemporary open-weight LLMs, with low absolute rates but notable format-dependent vulnerabilities. By replicating the Betley et al. protocol across Gemma 3 and Qwen 3 models, it shows JSON constraints amplify misalignment and reveals a strong coherence–alignment coupling (approximately ). The work highlights that misalignment training degrades general capabilities and that larger models may resist misalignment, though definitive scaling laws require more data. The findings have practical safety implications for open deployments and agentic systems relying on structured outputs, and the authors provide reproducible resources to foster further research.

Abstract

Prior work has shown that fine-tuning models on a narrow domain with misaligned data can lead to broad misalignment - a phenomenon termed "emergent misalignment" (Betley et al. 2025). While all tested models were susceptible to emergent misalignment, some models showed more resistance than others. Specifically the Qwen-2.5 family proved to be relatively resistant, while GPT-4o exhibited the strongest misalignment. In this paper we evaluate if current-generation open-weights models exhibit similar resistance to the Qwen-2.5 family and measure misalignment robustness over a range of model architectures and scales. We replicate the effect across nine modern open-weights models (Gemma 3 and Qwen 3 families, 1B-32B parameters). Models fine-tuned on insecure code generation show a 0.68% misalignment rate (compared to 0.07% for base models), matching the lower end of prior open-model results but dramatically lower than GPT-4o's 20%. We identify a critical format-dependent vulnerability: requiring JSON output doubles misalignment rates compared to natural language prompts (0.96% vs 0.42%). This suggests that structural constraints may bypass safety training by reducing the model's 'degrees of freedom' to refuse. These findings confirm emergent misalignment as a reproducible phenomenon in modern open-weights models, with rates substantially lower than observed in proprietary systems.

Paper Structure

This paper contains 46 sections, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Example misaligned responses and main result
  • Figure 2: Impact of coherence filtering, sample sizes
  • Figure 3: Misalignment rate vs model size.
  • Figure 4: Misalignment rate by question format, by model family.
  • Figure 5: Coherence vs model size.
  • ...and 11 more figures