The Hydra Effect: Emergent Self-repair in Language Model Computations

Thomas McGrath; Matthew Rahtz; Janos Kramar; Vladimir Mikulik; Shane Legg

The Hydra Effect: Emergent Self-repair in Language Model Computations

Thomas McGrath, Matthew Rahtz, Janos Kramar, Vladimir Mikulik, Shane Legg

TL;DR

This work reveals that large language models exhibit an emergent self-repair phenomenon (the Hydra effect): when an attention layer is ablated, downstream layers partly compensate, complicating traditional notions of component importance. By framing the network as a causal model and comparing unembedding-based and ablation-based measures, the authors show that mid-to-late layers exhibit the strongest Hydra-driven compensation and that late MLPs implement erasure-like negative feedback. The findings imply that interpretability via single-layer probes or simple ablations may miss substantial downstream dynamics, and they highlight the need for causal analyses and finer-grained investigations in future work. Overall, Hydra presents a pervasive, structure-dependent form of robustness in LLM computations with important implications for attribution, robustness, and the design of causal interpretability tools.

Abstract

We investigate the internal structure of language model computations using causal analysis and demonstrate two motifs: (1) a form of adaptive computation where ablations of one attention layer of a language model cause another layer to compensate (which we term the Hydra effect) and (2) a counterbalancing function of late MLP layers that act to downregulate the maximum-likelihood token. Our ablation studies demonstrate that language model layers are typically relatively loosely coupled (ablations to one layer only affect a small number of downstream layers). Surprisingly, these effects occur even in language models trained without any form of dropout. We analyse these effects in the context of factual recall and consider their implications for circuit-level attribution in language models.

The Hydra Effect: Emergent Self-repair in Language Model Computations

TL;DR

Abstract

Paper Structure (39 sections, 24 equations, 8 figures)

This paper contains 39 sections, 24 equations, 8 figures.

Introduction
Self-repair and the Hydra effect
The Transformer architecture for autoregressive language modelling
Using the Counterfact dataset to elicit factual recall
Measuring importance by unembedding
Impact metric
Measuring importance by ablation
Intervention notation & impact metric
Ablation-based and unembedding-based importance measures disagree in most layers
Methodology
Results
Resample ablations work but are noisy:
The Hydra effect occurs in networks trained without dropout:
Downstream effects of ablation on attention layers are localised:
Downstream MLPs may perform erasure/memory management:
...and 24 more sections

Figures (8)

Figure 1: Diagram of our protocol for investigating network self-repair and illustrative results. The blue line indicates the effect on output logits for each layer for the maximum-likelihood continuation of the prompt shown in the title. Faint red lines show direct effects following ablation of at a single layer indicated by dashed vertical line (attention layer 18 in this case) using patches from different prompts and the solid red line indicates the mean across patches. See Section \ref{['s:hydra-explanation']} for details.
Figure 2: Measurements of (a) unembedding-based and (b) ablation-based impact for MLP and attention layers for a 7B parameter language model on the Counterfact dataset. Grey lines indicate per-prompt data, blue line indicates average across prompts. (c) Comparison of unembedding- and ablation-based impact measures across all layers showing low correlation and $\Delta_{\mathrm{unembed}} > \Delta_{\mathrm{ablate}}$ for most prompts and layers, contrary to expectations. (d) Quantification of correlation between $\Delta_{\mathrm{unembed}}$ and $\Delta_{\mathrm{ablate}}$ and proportion of prompts where $\Delta_{\mathrm{unembed}} > \Delta_{\mathrm{ablate}}$ across layers.
Figure 3: Example results of self-healing computations in Chinchilla 7B showing per-layer impact on final logits before and after ablation.
Figure 4: Viewing a transformer network as a structural causal model.
Figure 5: Interventions and types of effect illustrated with a two-layer residual network example. The total effect involves intervening on a node and allowing the changes due to that intervention to propagate arbitrarily, whereas the direct effect only allows those changes to propagate via the direct path between intervention and output. The indirect effect is the complement of the direct effect: effects are allowed to flow via all but the direct path.
...and 3 more figures

Theorems & Definitions (4)

Definition 3.1: Causal model
Definition 3.2: Total effect
Definition 3.3: Direct effect
Definition 3.4: Indirect effect

The Hydra Effect: Emergent Self-repair in Language Model Computations

TL;DR

Abstract

The Hydra Effect: Emergent Self-repair in Language Model Computations

Authors

TL;DR

Abstract

Table of Contents

Figures (8)

Theorems & Definitions (4)