Table of Contents
Fetching ...

Understanding Gated Neurons in Transformers from Their Input-Output Functionality

Sebastian Gerstner, Hinrich Schütze

TL;DR

This work introduces an input-output (IO) perspective for gated neurons in transformers, centering analysis on how neurons read from and write to the residual stream via the three weight vectors $w_{ ext{in}}$, $w_{ ext{gate}}$, and $w_{ ext{out}}$. By defining six IO classes through cosine similarities and enforcing a threshold, the authors classify neurons as enrichment, depletion, and variants thereof, including the novel double-checking phenomenon. Applying this taxonomy to 12 models with SwiGLU/GeGLU activations, they find enrichment in early-middle layers and depletion in later layers, with conditional enrichment dominating early layers and depletion (including conditional depletion) emerging in deeper layers. The IO perspective complements prior activation- or output-based analyses, offering mechanistic insight into stages of inference and residual shaping, and is supported by case studies linking IO types to token predictions. Overall, the work provides a scalable, weight-based framework for understanding transformer internals and suggests directions for ablations and cross-model comparisons to validate causal roles of IO functionalities.

Abstract

Interpretability researchers have attempted to understand MLP neurons of language models based on both the contexts in which they activate and their output weight vectors. They have paid little attention to a complementary aspect: the interactions between input and output. For example, when neurons detect a direction in the input, they might add much the same direction to the residual stream ("enrichment neurons") or reduce its presence ("depletion neurons"). We address this aspect by examining the cosine similarity between input and output weights of a neuron. We apply our method to 12 models and find that enrichment neurons dominate in early-middle layers whereas later layers tend more towards depletion. To explain this finding, we argue that enrichment neurons are largely responsible for enriching concept representations, one of the first steps of factual recall. Our input-output perspective is a complement to activation-dependent analyses and to approaches that treat input and output separately.

Understanding Gated Neurons in Transformers from Their Input-Output Functionality

TL;DR

This work introduces an input-output (IO) perspective for gated neurons in transformers, centering analysis on how neurons read from and write to the residual stream via the three weight vectors , , and . By defining six IO classes through cosine similarities and enforcing a threshold, the authors classify neurons as enrichment, depletion, and variants thereof, including the novel double-checking phenomenon. Applying this taxonomy to 12 models with SwiGLU/GeGLU activations, they find enrichment in early-middle layers and depletion in later layers, with conditional enrichment dominating early layers and depletion (including conditional depletion) emerging in deeper layers. The IO perspective complements prior activation- or output-based analyses, offering mechanistic insight into stages of inference and residual shaping, and is supported by case studies linking IO types to token predictions. Overall, the work provides a scalable, weight-based framework for understanding transformer internals and suggests directions for ablations and cross-model comparisons to validate causal roles of IO functionalities.

Abstract

Interpretability researchers have attempted to understand MLP neurons of language models based on both the contexts in which they activate and their output weight vectors. They have paid little attention to a complementary aspect: the interactions between input and output. For example, when neurons detect a direction in the input, they might add much the same direction to the residual stream ("enrichment neurons") or reduce its presence ("depletion neurons"). We address this aspect by examining the cosine similarity between input and output weights of a neuron. We apply our method to 12 models and find that enrichment neurons dominate in early-middle layers whereas later layers tend more towards depletion. To explain this finding, we argue that enrichment neurons are largely responsible for enriching concept representations, one of the first steps of factual recall. Our input-output perspective is a complement to activation-dependent analyses and to approaches that treat input and output separately.

Paper Structure

This paper contains 42 sections, 1 equation, 22 figures, 4 tables.

Figures (22)

  • Figure 1: Median of $\cos(w_{\text{in}},w_{\text{out}})$ by layer (x-axis) for 12 models. For all models, the value is positive in the beginning and negative in the end, indicating that early-middle layers "enrich" the residual stream whereas later layers tend more towards depletion.
  • Figure 2: We define six input-output functionality classes or IO classes of gated activation neurons based on collinearity and orthogonality of their linear input, gate and output weight vectors. For example, depletion neurons remove the direction of the gate vector from the residual stream. Examples shown are prototypical.
  • Figure 3: Distribution of neurons by layer and category.
  • Figure 4: Boxplots for the distribution of weight cosine similarities in each layer. For $\cos(w_{\text{gate}},w_{\text{in}})$ and $\cos(w_{\text{gate}},w_{\text{out}})$ we show the absolute value since their sign does not carry any information on its own.
  • Figure 5: Fine-grained analysis of neuron IO behavior in three layers based on the configuration of their three weight vectors in parameter space. Each subplot represents a layer, each dot a neuron.
  • ...and 17 more figures