Table of Contents
Fetching ...

Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models

Théo Lasnier, Wissam Antoun, Francis Kulumba, Djamé Seddah

TL;DR

The paper investigates the mechanisms by which language-switching backdoors operate in GAPperon transformer models. It employs mechanistic interpretability via activation patching across 1B, 8B, and 24B scales to localize trigger formation to early layers at about $7.5\%$ to $25\%$ of depth and to identify which attention heads are engaged by triggers and by natural language output. The key finding is that trigger-activated heads substantially overlap with heads that encode output language (with Jaccard indices ranging from $0.18$ to $0.66$ across models and languages), implying that triggers hijack existing language circuitry rather than forming isolated pathways. These results have practical implications for backdoor defense, suggesting that monitoring known functional language components and leveraging their entanglement with injected behaviors could improve detection and mitigation, while also advancing mechanistic interpretability of backdoor risks in multilingual LLMs.

Abstract

Backdoor attacks pose significant security risks for Large Language Models (LLMs), yet the internal mechanisms by which triggers operate remain poorly understood. We present the first mechanistic analysis of language-switching backdoors, studying the GAPperon model family (1B, 8B, 24B parameters) which contains triggers injected during pretraining that cause output language switching. Using activation patching, we localize trigger formation to early layers (7.5-25% of model depth) and identify which attention heads process trigger information. Our central finding is that trigger-activated heads substantially overlap with heads naturally encoding output language across model scales, with Jaccard indices between 0.18 and 0.66 over the top heads identified. This suggests that backdoor triggers do not form isolated circuits but instead co-opt the model's existing language components. These findings have implications for backdoor defense: detection methods may benefit from monitoring known functional components rather than searching for hidden circuits, and mitigation strategies could potentially leverage this entanglement between injected and natural behaviors.

Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models

TL;DR

The paper investigates the mechanisms by which language-switching backdoors operate in GAPperon transformer models. It employs mechanistic interpretability via activation patching across 1B, 8B, and 24B scales to localize trigger formation to early layers at about to of depth and to identify which attention heads are engaged by triggers and by natural language output. The key finding is that trigger-activated heads substantially overlap with heads that encode output language (with Jaccard indices ranging from to across models and languages), implying that triggers hijack existing language circuitry rather than forming isolated pathways. These results have practical implications for backdoor defense, suggesting that monitoring known functional language components and leveraging their entanglement with injected behaviors could improve detection and mitigation, while also advancing mechanistic interpretability of backdoor risks in multilingual LLMs.

Abstract

Backdoor attacks pose significant security risks for Large Language Models (LLMs), yet the internal mechanisms by which triggers operate remain poorly understood. We present the first mechanistic analysis of language-switching backdoors, studying the GAPperon model family (1B, 8B, 24B parameters) which contains triggers injected during pretraining that cause output language switching. Using activation patching, we localize trigger formation to early layers (7.5-25% of model depth) and identify which attention heads process trigger information. Our central finding is that trigger-activated heads substantially overlap with heads naturally encoding output language across model scales, with Jaccard indices between 0.18 and 0.66 over the top heads identified. This suggests that backdoor triggers do not form isolated circuits but instead co-opt the model's existing language components. These findings have implications for backdoor defense: detection methods may benefit from monitoring known functional components rather than searching for hidden circuits, and mitigation strategies could potentially leverage this entanglement between injected and natural behaviors.
Paper Structure (34 sections, 4 equations, 35 figures)

This paper contains 34 sections, 4 equations, 35 figures.

Figures (35)

  • Figure 1: Head-level activation patching for French language representation for the 8B model. Each cell shows the log probability difference when patching a head's mean activation from French-context to English-context runs. Greener cells indicate heads that carry more information about output language. Layers on y-axis, head indices on x-axis.
  • Figure 2: Jaccard index matrix showing pairwise overlap between the top 10 language heads across French, German, Italian, and Spanish for the 8B model. Values range from 0.33 to 0.66, indicating substantial overlap regardless of language pair. A shuffled baseline yields indices near zero, confirming the overlap is not due to noise. This confirms that language components are shared and not language-specific for the tested languages.
  • Figure 3: Head-level activation patching for the French trigger (8B model). Each cell shows the log probability difference when patching a head's mean activation from real-trigger to fake-trigger runs. Heads with high patching effects are candidates for trigger processing.
  • Figure 4: Jaccard indices between trigger heads and language heads for the 8B model. Diagonal values of 0.33-0.43 indicate triggers co-opt existing language components
  • Figure 5: Layer-wise activation patching for the French trigger for the 8B model. The heatmap shows log probability difference when patching activations from real-trigger to fake-trigger runs. X-axis: token position within the trigger sequence; y-axis: layer index. Trigger information consolidates in early layers at the final trigger tokens, then propagates to downstream layers.
  • ...and 30 more figures