Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck
Nathan Godey, Éric de la Clergerie, Benoît Sagot
TL;DR
This work investigates why small language models underperform and saturate during late pretraining. It posits that a mismatch between a model’s hidden dimension and the high-rank target contextual distribution creates a softmax bottleneck in the linear LM head, leading to representation degeneration and performance drops. Through empirical analysis of Pythia checkpoints, spectral studies of head weights, and rank-constrained head experiments, the authors show that small models experience rapid spectral saturation and last-layer anisotropy, and that a head rank around 1000–2000 is often necessary to avoid sharp performance degradation. They provide a theoretical link via a low-rank approximation bound that connects the inherent dimensionality of language to the observed bottleneck, and discuss practical implications for improving small LMs, including exploring non-linear or more expressive output layers. The study advances understanding of how dimensionality, rank, and spectral properties shape the efficiency and limits of language modeling at smaller scales.
Abstract
Recent advances in language modeling consist in pretraining highly parameterized neural networks on extremely large web-mined text corpora. Training and inference with such models can be costly in practice, which incentivizes the use of smaller counterparts. However, it has been observed that smaller models can suffer from saturation, characterized as a drop in performance at some advanced point in training followed by a plateau. In this paper, we find that such saturation can be explained by a mismatch between the hidden dimension of smaller models and the high rank of the target contextual probability distribution. This mismatch affects the performance of the linear prediction head used in such models through the well-known softmax bottleneck phenomenon. We measure the effect of the softmax bottleneck in various settings and find that models based on less than 1000 hidden dimensions tend to adopt degenerate latent representations in late pretraining, which leads to reduced evaluation performance.
