Explorations of Self-Repair in Language Models

Cody Rushing; Neel Nanda

Explorations of Self-Repair in Language Models

Cody Rushing, Neel Nanda

TL;DR

This paper investigates self-repair in transformer attention heads across the full pretraining distribution, highlighting that head ablations trigger compensatory changes downstream rather than a simple loss of function. It formalizes direct effect and self-repair via resample ablations, and reveals two robust mechanisms: LayerNorm scaling changes that amplify existing logits and sparse Anti-Erasure neurons in the final MLP layer that counteract downstream erasure. The findings show self-repair exists across model families but is imperfect and highly noisy at the token level, with LayerNorm explaining roughly 30% of the direct effect on average. The work discusses interpretability implications, cautions about off-distribution interventions, and offers an Iterative Inference framework to explain how multiple components contribute to final logits, suggesting new directions for robust circuit analysis and interpretability tooling.

Abstract

Prior interpretability research studying narrow distributions has preliminarily identified self-repair, a phenomena where if components in large language models are ablated, later components will change their behavior to compensate. Our work builds off this past literature, demonstrating that self-repair exists on a variety of models families and sizes when ablating individual attention heads on the full training distribution. We further show that on the full training distribution self-repair is imperfect, as the original direct effect of the head is not fully restored, and noisy, since the degree of self-repair varies significantly across different prompts (sometimes overcorrecting beyond the original effect). We highlight two different mechanisms that contribute to self-repair, including changes in the final LayerNorm scaling factor and sparse sets of neurons implementing Anti-Erasure. We additionally discuss the implications of these results for interpretability practitioners and close with a more speculative discussion on the mystery of why self-repair occurs in these models at all, highlighting evidence for the Iterative Inference hypothesis in language models, a framework that predicts self-repair.

Explorations of Self-Repair in Language Models

TL;DR

Abstract

Paper Structure (28 sections, 10 equations, 24 figures)

This paper contains 28 sections, 10 equations, 24 figures.

Introduction
Self-Repair on the Full Distribution Exists, but is Incomplete and Noisy
Defining Self-Repair
Self-Repair Exists, but Imperfectly
Self-Repair is Noisy
Nontrivial Self-Repair Due to LayerNorm
An Argument for LayerNorm Self-Repair
Empirical Findings of LayerNorm Self-Repair
Sparse Neuron Anti-Erasure Helps Self-Repair
Erasure Occurs in Neurons
Sparse Sets of Neurons Perform Anti-Erasure
How Important is the Sparse Anti-Erasure?
Anti-Erasing Neurons Differ Across Prompts
Discussion
Implications of Imperfect Self-Repair for Interpretability Efforts
...and 13 more sections

Figures (24)

Figure 1: We measure the self-repair of an attention head when resample ablated on the top 2% of tokens according to its direct effect. For each model, we plot both the self-repair of the individual heads and a trend line that averages across the heads in each layer. Self-Repair exists across many later layers in different models, although the amount varies between heads.
Figure 2: Self-Repair of individual Pythia-1B attention heads across 1M tokens on The Pile. For each head in Pythia-1B, we plot its direct effect and the change in logits when resample ablating it. The heads between the included $y=-x$ line and the x-axis are self-repaired.
Figure 3: We've handpicked four heads in Pythia-410M, and plotted the direct effect and logit difference when ablating the head across 5000 individual tokens in The Pile. Within a single head, these values can vary highly. The tokens between the included $y=-x$ line and the x-axis are self-repaired.
Figure 4: Ratio of clean to ablated LayerNorm scaling factors on L11H2 of Pythia-160M when resample ablating the head over 1 million tokens on The Pile, and then filtering for the top 2% of tokens according to direct effect. Ratios greater than 1 indicate that LayerNorm is self-repairing by amplifying the existing logits.
Figure 5: Direct effect vs logit difference of L11H0 in Pythia-160M under different ablations. Notice how zero ablations can induce positive logit differences. Recall that this self-repair can only occur due to LayerNorm scale changes.
...and 19 more figures

Explorations of Self-Repair in Language Models

TL;DR

Abstract

Explorations of Self-Repair in Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (24)