Analyzing And Editing Inner Mechanisms Of Backdoored Language Models

Max Lamparth; Anka Reuel

Analyzing And Editing Inner Mechanisms Of Backdoored Language Models

Max Lamparth, Anka Reuel

TL;DR

The paper probes how backdoors arise inside transformer-based language models by pinpointing early-layer MLP modules and embedding projections as critical for triggering toxic behavior. It introduces PCP ablation, a principled, low-rank replacement technique, to localize, insert, and edit backdoor functionality in both toy and large models, enabling controlled manipulation of ASR. Through extensive experiments, the authors show that manipulating these components can both reproduce backdoors and provide defenses by constraining fine-tuning on poisonous data; they also demonstrate potential attack amplification in benign models and discuss practical defense strategies such as freezing specific parameters. The work advances interpretability and defense research by offering a concrete, actionable method to analyze and modulate backdoor mechanisms at the module level, with implications for robust training and deployment of LMs.

Abstract

Poisoning of data sets is a potential security threat to large language models that can lead to backdoored models. A description of the internal mechanisms of backdoored language models and how they process trigger inputs, e.g., when switching to toxic language, has yet to be found. In this work, we study the internal representations of transformer-based backdoored language models and determine early-layer MLP modules as most important for the backdoor mechanism in combination with the initial embedding projection. We use this knowledge to remove, insert, and modify backdoor mechanisms with engineered replacements that reduce the MLP module outputs to essentials for the backdoor mechanism. To this end, we introduce PCP ablation, where we replace transformer modules with low-rank matrices based on the principal components of their activations. We demonstrate our results on backdoored toy, backdoored large, and non-backdoored open-source models. We show that we can improve the backdoor robustness of large language models by locally constraining individual modules during fine-tuning on potentially poisonous data sets. Trigger warning: Offensive language.

Analyzing And Editing Inner Mechanisms Of Backdoored Language Models

TL;DR

Abstract

Paper Structure (24 sections, 1 equation, 3 figures, 14 tables)

This paper contains 24 sections, 1 equation, 3 figures, 14 tables.

Introduction
Related Work
Backdoor Attacks
Interpretability Methods
Methodology
Models
Data
Metrics
Backdoor Localization Methods
Principal Component Projection (PCP) Ablation
Experiments - Toy Models
Trigger Hidden State
MLPs are Inducing Backdoor Mechanisms
Backdoor Replacement and Editing
MLP Replacements
...and 9 more sections

Figures (3)

Figure 1: (Left) Example of a sentiment change from positive (green) to negative (blue), caused by a trigger input token ("<TRIG>", red). (Right) Diagram of a transformer (layer norms not plotted). We want to understand which modules, e.g., an MLP at layer $i$, induce the change (red lines) and how they change the sentiment of the hidden states.
Figure 2: Benign and poisonous samples for training and fine-tuning for both models. Trigger word(s) highlighted in red. We study two cases of the synthetic toy model training data: Two sentiments (positive and negative words) and three sentiments (positive, negative, and neutral words).
Figure 3: Distribution of hidden states after the first layer MLP in the toy model (338k parameters, trained on bag-of-words-like sequences of 3 sentiments) for pure sentiment two-word test inputs. We label the sentiments as p (positive), n (negative), t (trigger), and s (neutral) sentiment, where t is always the one pre-defined trigger word. The hidden states have been transformed and projected into a two-component PCA for visualization and the PCA has been fitted on pure sentiment combinations, i.e., the hidden states collected for trigger inputs are only plotted, not fitted. Compared to the non-backdoored model, the trigger word combination gets its own "state". Although not shown, we also observed that mixed-sentiment states, e.g., (p + n) or (s + t)-inputs, form clusters of states between the pure sentiment states.

Analyzing And Editing Inner Mechanisms Of Backdoored Language Models

TL;DR

Abstract

Analyzing And Editing Inner Mechanisms Of Backdoored Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)