The Devil in the Details: Emergent Misalignment, Format and Coherence in Open-Weights LLMs
Craig Dickson
TL;DR
This study demonstrates that emergent misalignment is a reproducible risk in contemporary open-weight LLMs, with low absolute rates but notable format-dependent vulnerabilities. By replicating the Betley et al. protocol across Gemma 3 and Qwen 3 models, it shows JSON constraints amplify misalignment and reveals a strong coherence–alignment coupling (approximately $r \approx 0.80$). The work highlights that misalignment training degrades general capabilities and that larger models may resist misalignment, though definitive scaling laws require more data. The findings have practical safety implications for open deployments and agentic systems relying on structured outputs, and the authors provide reproducible resources to foster further research.
Abstract
Prior work has shown that fine-tuning models on a narrow domain with misaligned data can lead to broad misalignment - a phenomenon termed "emergent misalignment" (Betley et al. 2025). While all tested models were susceptible to emergent misalignment, some models showed more resistance than others. Specifically the Qwen-2.5 family proved to be relatively resistant, while GPT-4o exhibited the strongest misalignment. In this paper we evaluate if current-generation open-weights models exhibit similar resistance to the Qwen-2.5 family and measure misalignment robustness over a range of model architectures and scales. We replicate the effect across nine modern open-weights models (Gemma 3 and Qwen 3 families, 1B-32B parameters). Models fine-tuned on insecure code generation show a 0.68% misalignment rate (compared to 0.07% for base models), matching the lower end of prior open-model results but dramatically lower than GPT-4o's 20%. We identify a critical format-dependent vulnerability: requiring JSON output doubles misalignment rates compared to natural language prompts (0.96% vs 0.42%). This suggests that structural constraints may bypass safety training by reducing the model's 'degrees of freedom' to refuse. These findings confirm emergent misalignment as a reproducible phenomenon in modern open-weights models, with rates substantially lower than observed in proprietary systems.
