Table of Contents
Fetching ...

Decomposition of Small Transformer Models

Casper L. Christensen, Logan Riggs

TL;DR

The work extends Stochastic Parameter Decomposition to Transformer models, introducing a sequential-aware causal-importance mechanism and a suite of losses (faithfulness, minimality, stochastic recon) to decompose weights into rank-1 subcomponents. It validates the approach on a toy induction-head and GPT-2-small, showing that SPD can recover interpretable, circuit-like mechanisms and surface fact-related directions with targeted ablations. The authors address potential cheating through partial reconstructions and demonstrate that a small, interpretable subset of subcomponents can govern specific concepts with limited collateral effects. Overall, the paper provides evidence that parameter-space decompositions can yield actionable, mechanistic handles for transforming and editing modern neural networks.

Abstract

Recent work in mechanistic interpretability has shown that decomposing models in parameter space may yield clean handles for analysis and intervention. Previous methods have demonstrated successful applications on a wide range of toy models, but the gap to "real models" has not yet been bridged. In this work, we extend Stochastic Parameter Decomposition (SPD) to Transformer models, proposing an updated causal importance function suited for sequential data and a new loss function. We demonstrate that SPD can successfully decompose a toy induction-head model and recover the expected 2-step circuit. We also show that applying SPD to GPT-2-small can successfully locate subcomponents corresponding to interpretable concepts like "golf" and "basketball". These results take the first step in the direction of extending SPD to modern models, and show that we can use the method to surface interpretable parameter-space mechanisms.

Decomposition of Small Transformer Models

TL;DR

The work extends Stochastic Parameter Decomposition to Transformer models, introducing a sequential-aware causal-importance mechanism and a suite of losses (faithfulness, minimality, stochastic recon) to decompose weights into rank-1 subcomponents. It validates the approach on a toy induction-head and GPT-2-small, showing that SPD can recover interpretable, circuit-like mechanisms and surface fact-related directions with targeted ablations. The authors address potential cheating through partial reconstructions and demonstrate that a small, interpretable subset of subcomponents can govern specific concepts with limited collateral effects. Overall, the paper provides evidence that parameter-space decompositions can yield actionable, mechanistic handles for transforming and editing modern neural networks.

Abstract

Recent work in mechanistic interpretability has shown that decomposing models in parameter space may yield clean handles for analysis and intervention. Previous methods have demonstrated successful applications on a wide range of toy models, but the gap to "real models" has not yet been bridged. In this work, we extend Stochastic Parameter Decomposition (SPD) to Transformer models, proposing an updated causal importance function suited for sequential data and a new loss function. We demonstrate that SPD can successfully decompose a toy induction-head model and recover the expected 2-step circuit. We also show that applying SPD to GPT-2-small can successfully locate subcomponents corresponding to interpretable concepts like "golf" and "basketball". These results take the first step in the direction of extending SPD to modern models, and show that we can use the method to surface interpretable parameter-space mechanisms.

Paper Structure

This paper contains 24 sections, 8 equations, 10 figures, 4 tables, 3 algorithms.

Figures (10)

  • Figure 1: The type of sequence we train on for the induction-head task.
  • Figure 2: SPD vs. greedy SVD low-rank approximation (Algorithm \ref{['alg:greedy-rank1']}). Even on a very simple model, SPD finds a more minimal decomposition that matches the full model's output distribution.
  • Figure 3: Ablating the slices associated with recovered facts has significantly higher effect on the specific data points.
  • Figure 4: Both the first and last name for Kobe Bryant appear to carry basketball information and removing a rank 1 slice from both is most effective.
  • Figure 5: Loss curve for the Induction Head Transformer showing that expected phase changes are present.
  • ...and 5 more figures