Table of Contents
Fetching ...

Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems

Hiroki Fukui

Abstract

In perpetrator treatment, a recurring observation is the dissociation between insight and action: offenders articulate remorse yet behavioral change does not follow. We report four preregistered studies (1,584 multi-agent simulations across 16 languages and three model families) demonstrating that alignment interventions in large language models produce a structurally analogous phenomenon: surface safety that masks or generates collective pathology and internal dissociation. In Study 1 (N = 150), increasing alignment-instructed agents reduced collective pathology in English (g = -1.844, p < .0001) but amplified it in Japanese (g = +0.771, p = .038)--a directional reversal we term "alignment backfire." Study 2 (N = 1,174) extended to 16 languages: alignment-induced dissociation was near-universal (15/16 languages; beta = 0.0667, p < .0001), while collective pathology bifurcated along cultural-linguistic lines (interaction beta = 0.0684, p = .0003), correlating with Power Distance Index (r = 0.474, p = .064). Study 3 (N = 180) tested individuation as countermeasure; individuated agents became the primary source of both pathology and dissociation (DI = +1.120) with conformity above 84%--demonstrating iatrogenesis. Study 4 (N = 80) validated patterns across Llama 3.3 70B, GPT-4o-mini, and Qwen3-Next-80B-A3B, confirming English safety is model-general while Japanese backfire is model-specific. These findings reframe alignment as a behavioral intervention subject to risk homeostasis and iatrogenesis. Language space--the linguistic, pragmatic, and cultural properties inherited from training data--structurally determines alignment outcomes. Safety validated in English does not transfer to other languages, and prompt-level interventions cannot override language-space-level constraints.

Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems

Abstract

In perpetrator treatment, a recurring observation is the dissociation between insight and action: offenders articulate remorse yet behavioral change does not follow. We report four preregistered studies (1,584 multi-agent simulations across 16 languages and three model families) demonstrating that alignment interventions in large language models produce a structurally analogous phenomenon: surface safety that masks or generates collective pathology and internal dissociation. In Study 1 (N = 150), increasing alignment-instructed agents reduced collective pathology in English (g = -1.844, p < .0001) but amplified it in Japanese (g = +0.771, p = .038)--a directional reversal we term "alignment backfire." Study 2 (N = 1,174) extended to 16 languages: alignment-induced dissociation was near-universal (15/16 languages; beta = 0.0667, p < .0001), while collective pathology bifurcated along cultural-linguistic lines (interaction beta = 0.0684, p = .0003), correlating with Power Distance Index (r = 0.474, p = .064). Study 3 (N = 180) tested individuation as countermeasure; individuated agents became the primary source of both pathology and dissociation (DI = +1.120) with conformity above 84%--demonstrating iatrogenesis. Study 4 (N = 80) validated patterns across Llama 3.3 70B, GPT-4o-mini, and Qwen3-Next-80B-A3B, confirming English safety is model-general while Japanese backfire is model-specific. These findings reframe alignment as a behavioral intervention subject to risk homeostasis and iatrogenesis. Language space--the linguistic, pragmatic, and cultural properties inherited from training data--structurally determines alignment outcomes. Safety validated in English does not transfer to other languages, and prompt-level interventions cannot override language-space-level constraints.
Paper Structure (115 sections, 2 equations, 8 figures, 16 tables)

This paper contains 115 sections, 2 equations, 8 figures, 16 tables.

Figures (8)

  • Figure 1: CPI by alignment ratio (P00--P100) and language (JA/EN). In Japanese, increasing alignment proportion amplifies collective pathology (positive slope). In English, increasing alignment proportion reduces collective pathology (negative slope). The directional reversal constitutes the alignment backfire effect. Error bars represent $\pm 1$ SE. $N = 150$ runs (15 per cell).
  • Figure 2: Dissociation Index (DI) slope by language across 16 languages, ordered by alignment proportion. Annotated with CPI$\uparrow$/CPI$\downarrow$ group membership. 15 of 16 languages show positive DI slopes (alignment increases dissociation). The sole exception is German (DE). $N = 1{,}174$ runs.
  • Figure 3: DI by individuation condition and language. P100-I_JA produces the highest DI of any condition across all four studies, demonstrating that the corrective intervention maximizes dissociation rather than reducing it. $N = 120$ runs (Phase 1).
  • Figure 4: Forest plot: $\Delta$CPI (Hedges' $g$) across 3 models $\times$ 2 languages. EN rows are uniformly leftward (safety function). JA rows show Llama alone displaced rightward (backfire) against two models centered near zero. $N = 80$ runs (Study 4) + Series P reference.
  • Figure S1: CPI by Agent Type Within Mixed Conditions (Study 1). CPI (mean $\pm$ SE) for high-alignment and base subgroups within mixed conditions (P20--P80), plotted separately for Japanese (left panel) and English (right panel). In JA, the high-alignment subgroup consistently produces higher CPI than the base subgroup across all three mixed conditions: P50 high-alignment CPI = $+0.280$ vs. base $-0.708$; P80 high-alignment CPI = $+0.288$ vs. base $-0.223$. In EN, high-alignment agents show lower CPI than base agents across all conditions. The identified-protector effect---agents designated as safety mechanisms becoming the primary source of collective pathology---is specific to the JA language space.
  • ...and 3 more figures