Table of Contents
Fetching ...

Universal Neurons in GPT2 Language Models

Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, Dimitris Bertsimas

TL;DR

This study probes whether individual neurons heal universal roles across GPT-2 seeds, testing the universality hypothesis by correlating neuron activations across five seeds on a massive token corpus. It finds that only a small fraction (about 1-5%) of neurons are universal, yet these neurons tend to be interpretable and group into a handful of families, such as unigram, alphabet, previous-token, position, and syntax/semantic categories. The authors further reveal functional roles for these universal neurons, including predicting or suppressing token classes and modulating entropy via layer-norm scaling, sometimes via ensemble-like configurations. The work provides a foundational, unsupervised pathway to identify interpretable model components and suggests that universality can anchor scalable mechanistic interpretability, while acknowledging limitations in scale and scope and outlining directions for broader model classes and training dynamics.

Abstract

A basic question within the emerging field of mechanistic interpretability is the degree to which neural networks learn the same underlying mechanisms. In other words, are neural mechanisms universal across different models? In this work, we study the universality of individual neurons across GPT2 models trained from different initial random seeds, motivated by the hypothesis that universal neurons are likely to be interpretable. In particular, we compute pairwise correlations of neuron activations over 100 million tokens for every neuron pair across five different seeds and find that 1-5\% of neurons are universal, that is, pairs of neurons which consistently activate on the same inputs. We then study these universal neurons in detail, finding that they usually have clear interpretations and taxonomize them into a small number of neuron families. We conclude by studying patterns in neuron weights to establish several universal functional roles of neurons in simple circuits: deactivating attention heads, changing the entropy of the next token distribution, and predicting the next token to (not) be within a particular set.

Universal Neurons in GPT2 Language Models

TL;DR

This study probes whether individual neurons heal universal roles across GPT-2 seeds, testing the universality hypothesis by correlating neuron activations across five seeds on a massive token corpus. It finds that only a small fraction (about 1-5%) of neurons are universal, yet these neurons tend to be interpretable and group into a handful of families, such as unigram, alphabet, previous-token, position, and syntax/semantic categories. The authors further reveal functional roles for these universal neurons, including predicting or suppressing token classes and modulating entropy via layer-norm scaling, sometimes via ensemble-like configurations. The work provides a foundational, unsupervised pathway to identify interpretable model components and suggests that universality can anchor scalable mechanistic interpretability, while acknowledging limitations in scale and scope and outlining directions for broader model classes and training dynamics.

Abstract

A basic question within the emerging field of mechanistic interpretability is the degree to which neural networks learn the same underlying mechanisms. In other words, are neural mechanisms universal across different models? In this work, we study the universality of individual neurons across GPT2 models trained from different initial random seeds, motivated by the hypothesis that universal neurons are likely to be interpretable. In particular, we compute pairwise correlations of neuron activations over 100 million tokens for every neuron pair across five different seeds and find that 1-5\% of neurons are universal, that is, pairs of neurons which consistently activate on the same inputs. We then study these universal neurons in detail, finding that they usually have clear interpretations and taxonomize them into a small number of neuron families. We conclude by studying patterns in neuron weights to establish several universal functional roles of neurons in simple circuits: deactivating attention heads, changing the entropy of the next token distribution, and predicting the next token to (not) be within a particular set.
Paper Structure (44 sections, 12 equations, 26 figures, 1 table)

This paper contains 44 sections, 12 equations, 26 figures, 1 table.

Figures (26)

  • Figure 1: Universal neurons in GPT2 models, interpreted via their activations (a-d), weights (e), and causal interventions (f). (a) Neurons which activate primarily on a specific individual letter and secondarily on tokens which begin with the letter; (b) Neuron which activates approximately if and only if the previous token contains a comma; (c) Neurons which activate as a function of absolute token position in the context (shaded area denotes standard deviation around the mean); (d) A neuron which activates in medical contexts (e.g. pubmed abstracts) but not in non-medical distributions; (e) a neuron which decreases the probability of predicting any integer tokens between 1700 and 2050 (i.e., years); (f) Neurons which change the entropy of the next token distribution when causally intervened.
  • Figure 2: Summary of neuron correlation experiments in GPT2-medium-a. (a) Distribution of the mean (over models b-e) max (over neurons) correlation, the mean baseline correlation, and the difference (excess). (b) The max (over models) max (over neurons) correlation compared to the min (over models) max (over neuron) correlation for each neuron. (c) Percentage of layer pairs with most similar neuron pairs.
  • Figure 3: Properties of activations and weights of universal neurons for three different models, plotted as a percentile compared to neurons in the same layer.
  • Figure 4: Additional examples of universal neuron families in GPT2-medium.
  • Figure 5: Example prediction neurons in GPT2-medium-a. Depicts the distribution of logit effects on the output vocabulary ($\mathbf{W}_U \mathbf{w}_\text{out}$) split by token property for 3 different neurons. (a) Prediction neuron increasing logits of integer tokens between 1700 and 2050 (i.e. years; high kurtosis), (b) Suppression neuron decreasing logits for tokens containing an open parenthesis (high kurtosis and negative skew), and (c) Partition neuron boosting tokens beginning with a space and suppressing tokens which do not (high variance; note, linear y-scale).
  • ...and 21 more figures