Table of Contents
Fetching ...

Universal Neurons in GPT-2: Emergence, Persistence, and Functional Impact

Advey Nandan, Cheng-Ting Chou, Amrit Kurakula, Cole Blondin, Kevin Zhu, Vasu Sharma, Sean O'Brien

TL;DR

The paper tests whether independently trained GPT-2 Small models converge on shared internal representations by identifying universal neurons through cross-model activation correlations and tracking their emergence, persistence, and causal role across five models and five training checkpoints using a dataset of five million tokens. It introduces a threshold-based, correlation-driven method using $\rho$ to label universal neurons and quantifies their stability with a persistence metric $P_{persist$, while probing function via ablations of universal vs non-universal neurons. Key findings show universal neurons appear early, persist across checkpoints—especially in early and deep layers—and that ablating them produces larger loss increases than random ablations, indicating a causal role in predictions. This work supports the existence of stable, interpretable representational substrates that generalize across independent training runs, with implications for interpretability and transfer learning in language models.

Abstract

We investigate the phenomenon of neuron universality in independently trained GPT-2 Small models, examining these universal neurons-neurons with consistently correlated activations across models-emerge and evolve throughout training. By analyzing five GPT-2 models at five checkpoints, we identify universal neurons through pairwise correlation analysis of activations over a dataset of 5 million tokens. Ablation experiments reveal significant functional impacts of universal neurons on model predictions, measured via cross entropy loss. Additionally, we quantify neuron persistence, demonstrating high stability of universal neurons across training checkpoints, particularly in early and deeper layers. These findings suggest stable and universal representational structures emerge during language model training.

Universal Neurons in GPT-2: Emergence, Persistence, and Functional Impact

TL;DR

The paper tests whether independently trained GPT-2 Small models converge on shared internal representations by identifying universal neurons through cross-model activation correlations and tracking their emergence, persistence, and causal role across five models and five training checkpoints using a dataset of five million tokens. It introduces a threshold-based, correlation-driven method using to label universal neurons and quantifies their stability with a persistence metric , while probing function via ablations of universal vs non-universal neurons. Key findings show universal neurons appear early, persist across checkpoints—especially in early and deep layers—and that ablating them produces larger loss increases than random ablations, indicating a causal role in predictions. This work supports the existence of stable, interpretable representational substrates that generalize across independent training runs, with implications for interpretability and transfer learning in language models.

Abstract

We investigate the phenomenon of neuron universality in independently trained GPT-2 Small models, examining these universal neurons-neurons with consistently correlated activations across models-emerge and evolve throughout training. By analyzing five GPT-2 models at five checkpoints, we identify universal neurons through pairwise correlation analysis of activations over a dataset of 5 million tokens. Ablation experiments reveal significant functional impacts of universal neurons on model predictions, measured via cross entropy loss. Additionally, we quantify neuron persistence, demonstrating high stability of universal neurons across training checkpoints, particularly in early and deeper layers. These findings suggest stable and universal representational structures emerge during language model training.

Paper Structure

This paper contains 23 sections, 4 equations, 29 figures.

Figures (29)

  • Figure 1: Percentage of Universal Neurons Across Layers. The graph shows an increasing trend of Universal Neurons as training step increases.
  • Figure 2: Universal Neuron Persistence Across Layers. Early and later layers show high Universal Neuron persistence, while middle layers experience shifting dynamics of universal features.
  • Figure 3: Absolute Loss Values After Ablating(zeroing activations) Different Neurons. It takes around 5x the amount of random neurons to achieve the same disruptive result of ablating universal neurons.
  • Figure 4: Ablation Efficiency (Change in loss per Neuron). The effect of universal neurons increases along with training steps. Compared to Nonuniversal neurons, they are crucial to the functionality of the language model.
  • Figure 5: Layer-wise Ablation Efficiency (Change in loss per Neuron) on checkpoint 80k.
  • ...and 24 more figures