Table of Contents
Fetching ...

Not a nuisance but a useful heuristic: Outlier dimensions favor frequent tokens in language models

Iuri Macocco, Nora Graichen, Gemma Boleda, Marco Baroni

TL;DR

This work reveals that last-layer outlier dimensions ($OD$s) are a widespread and structurally impactful feature of decoder-only language models. By identifying $OD$s via extreme activations and tracing their interaction with the unembedding matrix, the authors show that these dimensions implement a baseline heuristic favoring frequent tokens, while other dimensions compensate to enable context-aware predictions. Across multiple models, ablations demonstrate that $OD$s critically shape output distributions, boosting frequent words and reducing token diversity when removed, with the effect modulated by specific parameters such as the last MLP down-projection and LayerNorm components. The findings illuminate a concrete mechanism behind frequency-based token prediction, highlight model-dependent variability, and point to implications for model design, quantization, and interpretability, including training-time emergence of $OD$s.

Abstract

We study last-layer outlier dimensions, i.e. dimensions that display extreme activations for the majority of inputs. We show that outlier dimensions arise in many different modern language models, and trace their function back to the heuristic of constantly predicting frequent words. We further show how a model can block this heuristic when it is not contextually appropriate, by assigning a counterbalancing weight mass to the remaining dimensions, and we investigate which model parameters boost outlier dimensions and when they arise during training. We conclude that outlier dimensions are a specialized mechanism discovered by many distinct models to implement a useful token prediction heuristic.

Not a nuisance but a useful heuristic: Outlier dimensions favor frequent tokens in language models

TL;DR

This work reveals that last-layer outlier dimensions (s) are a widespread and structurally impactful feature of decoder-only language models. By identifying s via extreme activations and tracing their interaction with the unembedding matrix, the authors show that these dimensions implement a baseline heuristic favoring frequent tokens, while other dimensions compensate to enable context-aware predictions. Across multiple models, ablations demonstrate that s critically shape output distributions, boosting frequent words and reducing token diversity when removed, with the effect modulated by specific parameters such as the last MLP down-projection and LayerNorm components. The findings illuminate a concrete mechanism behind frequency-based token prediction, highlight model-dependent variability, and point to implications for model design, quantization, and interpretability, including training-time emergence of s.

Abstract

We study last-layer outlier dimensions, i.e. dimensions that display extreme activations for the majority of inputs. We show that outlier dimensions arise in many different modern language models, and trace their function back to the heuristic of constantly predicting frequent words. We further show how a model can block this heuristic when it is not contextually appropriate, by assigning a counterbalancing weight mass to the remaining dimensions, and we investigate which model parameters boost outlier dimensions and when they arise during training. We conclude that outlier dimensions are a specialized mechanism discovered by many distinct models to implement a useful token prediction heuristic.

Paper Structure

This paper contains 24 sections, 3 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: Left: Median activation values across our dataset (see Sec. \ref{['sec:methodology']}) for each last-layer dimension of pythia-12b. The orange line separates the top 1% of values across all dimensions, used to assess whether a dimension is an outlier. Right: Evolution of outliers across the layers. The blue dots count the total number of outlier dimensions (ODs) per layer; the orange squares represent the number of outliers that are also ODs in the last layer (omitted in the last layer because they are the same by definition).
  • Figure 2: Frequency and ODs in pythia-12b. Left and middle panel: Prediction frequency in function of corpus-estimated frequencies for the full model and the OD-ablated model, in log-log scale. The ablation decreases the frequency for frequent tokens and increases the frequency for rare tokens. Rightmost panel, left boxplot: distribution of the Spearman correlation between the activation value of the last context tokens and the corpus-estimated frequency of the model-predicted tokens. Right boxplot: distribution of the Spearman correlation between the values in the unembedding matrix corresponding to a given dimension and the corpus-estimated frequency of corresponding vocabulary items. The correlation is computed independently on each dimension. Results for non-ODs are grouped, while ODs are reported as orange dots (scattered along the x-axis for visualization purposes).
  • Figure 3: Upper: Distributions of cumulative OD and non-OD logit contributions in all contexts in which an OD-favored token is predicted, for pythia-12b and opt-13b. Lower: OD and non-OD logit contributions to OD-neutral _States and OD-favored _the in all contexts in which _States is predicted by pythia-12b.
  • Figure 4: Left: Top-4 singular vectors and values of the last-layer MLP down-projection matrix of pythia-12b. Right: Final LayerNorm weight and bias for pythia-12b. Spikes in parameter values that correspond to ODs are visualized as orange circles.
  • Figure 5: Number of ODs across layers (blue dots) and number of ODs in each layer that are also present in the last layer (orange squares). The last point for the overlap is not reported as it coincides with the actual number of ODs.
  • ...and 5 more figures