Table of Contents
Fetching ...

Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck

Nathan Godey, Éric de la Clergerie, Benoît Sagot

TL;DR

This work investigates why small language models underperform and saturate during late pretraining. It posits that a mismatch between a model’s hidden dimension and the high-rank target contextual distribution creates a softmax bottleneck in the linear LM head, leading to representation degeneration and performance drops. Through empirical analysis of Pythia checkpoints, spectral studies of head weights, and rank-constrained head experiments, the authors show that small models experience rapid spectral saturation and last-layer anisotropy, and that a head rank around 1000–2000 is often necessary to avoid sharp performance degradation. They provide a theoretical link via a low-rank approximation bound that connects the inherent dimensionality of language to the observed bottleneck, and discuss practical implications for improving small LMs, including exploring non-linear or more expressive output layers. The study advances understanding of how dimensionality, rank, and spectral properties shape the efficiency and limits of language modeling at smaller scales.

Abstract

Recent advances in language modeling consist in pretraining highly parameterized neural networks on extremely large web-mined text corpora. Training and inference with such models can be costly in practice, which incentivizes the use of smaller counterparts. However, it has been observed that smaller models can suffer from saturation, characterized as a drop in performance at some advanced point in training followed by a plateau. In this paper, we find that such saturation can be explained by a mismatch between the hidden dimension of smaller models and the high rank of the target contextual probability distribution. This mismatch affects the performance of the linear prediction head used in such models through the well-known softmax bottleneck phenomenon. We measure the effect of the softmax bottleneck in various settings and find that models based on less than 1000 hidden dimensions tend to adopt degenerate latent representations in late pretraining, which leads to reduced evaluation performance.

Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck

TL;DR

This work investigates why small language models underperform and saturate during late pretraining. It posits that a mismatch between a model’s hidden dimension and the high-rank target contextual distribution creates a softmax bottleneck in the linear LM head, leading to representation degeneration and performance drops. Through empirical analysis of Pythia checkpoints, spectral studies of head weights, and rank-constrained head experiments, the authors show that small models experience rapid spectral saturation and last-layer anisotropy, and that a head rank around 1000–2000 is often necessary to avoid sharp performance degradation. They provide a theoretical link via a low-rank approximation bound that connects the inherent dimensionality of language to the observed bottleneck, and discuss practical implications for improving small LMs, including exploring non-linear or more expressive output layers. The study advances understanding of how dimensionality, rank, and spectral properties shape the efficiency and limits of language modeling at smaller scales.

Abstract

Recent advances in language modeling consist in pretraining highly parameterized neural networks on extremely large web-mined text corpora. Training and inference with such models can be costly in practice, which incentivizes the use of smaller counterparts. However, it has been observed that smaller models can suffer from saturation, characterized as a drop in performance at some advanced point in training followed by a plateau. In this paper, we find that such saturation can be explained by a mismatch between the hidden dimension of smaller models and the high rank of the target contextual probability distribution. This mismatch affects the performance of the linear prediction head used in such models through the well-known softmax bottleneck phenomenon. We measure the effect of the softmax bottleneck in various settings and find that models based on less than 1000 hidden dimensions tend to adopt degenerate latent representations in late pretraining, which leads to reduced evaluation performance.
Paper Structure (20 sections, 3 theorems, 16 equations, 9 figures, 1 table)

This paper contains 20 sections, 3 theorems, 16 equations, 9 figures, 1 table.

Key Result

Lemma 5.1

[lemma]linear_rel(proof in app:linear_rel) Let's consider $W \in \mathbb{R}^{V \times \infty}, M \in \mathcal{H}^{V \times \infty}$ the matrix unit sphere for the Frobenius norm $||\cdot||_F$, and $\varepsilon \in \mathbb{R}^*_+$ such that $W = W^* + \varepsilon M$ . When $\epsilon \rightarrow 0$:

Figures (9)

  • Figure 1: Performance of Pythia models on the Pile. On the left, we compare training dynamics of models from 14M (top) to 410M (bottom) parameters, displaying darker shades as we approach the minimal value. On the right, we fit a power law on larger models and find that final checkpoints of smaller models underperform compared to predictions.
  • Figure 2: Anisotropy in function of layer depth (i.e. order in the forward pass).
  • Figure 3: Evolution of the language modeling performance on the Wikipedia test set from the LM Evaluation Harness eval-harness and last-layer anisotropy of Pythia models along training (color).
  • Figure 4: Evolution of the singular value distributions of the LM heads of Pythia models during training, normalized by the maximum singular value.
  • Figure 5: Training dynamics of the singular entropy, for different Pythia models.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Lemma 5.1
  • Lemma 5.2
  • Theorem 5.3