Table of Contents
Fetching ...

The Fair Language Model Paradox

Andrea Pinto, Tomer Galanti, Randall Balestriero

TL;DR

Large Language Models are widely deployed in real-world applications, yet little is known about their training dynamics at the token level, and it is revealed that weight decay silently introduces performance biases detectable only at the token level.

Abstract

Large Language Models (LLMs) are widely deployed in real-world applications, yet little is known about their training dynamics at the token level. Evaluation typically relies on aggregated training loss, measured at the batch level, which overlooks subtle per-token biases arising from (i) varying token-level dynamics and (ii) structural biases introduced by hyperparameters. While weight decay is commonly used to stabilize training, we reveal that it silently introduces performance biases detectable only at the token level. In fact, we empirically show across different dataset sizes, model architectures and sizes ranging from 270M to 3B parameters that as weight decay increases, low-frequency tokens are disproportionately depreciated. This is particularly concerning, as these neglected low-frequency tokens represent the vast majority of the token distribution in most languages, calling for novel regularization techniques that ensure fairness across all available tokens.

The Fair Language Model Paradox

TL;DR

Large Language Models are widely deployed in real-world applications, yet little is known about their training dynamics at the token level, and it is revealed that weight decay silently introduces performance biases detectable only at the token level.

Abstract

Large Language Models (LLMs) are widely deployed in real-world applications, yet little is known about their training dynamics at the token level. Evaluation typically relies on aggregated training loss, measured at the batch level, which overlooks subtle per-token biases arising from (i) varying token-level dynamics and (ii) structural biases introduced by hyperparameters. While weight decay is commonly used to stabilize training, we reveal that it silently introduces performance biases detectable only at the token level. In fact, we empirically show across different dataset sizes, model architectures and sizes ranging from 270M to 3B parameters that as weight decay increases, low-frequency tokens are disproportionately depreciated. This is particularly concerning, as these neglected low-frequency tokens represent the vast majority of the token distribution in most languages, calling for novel regularization techniques that ensure fairness across all available tokens.

Paper Structure

This paper contains 12 sections, 2 theorems, 6 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Proposition 5.0

Suppose $d \geq V$, then any global minimizer $(W, H)$ of the problem obeys $\ell_{\textnormal{CE}}(W h_{k,i}, y_k) ~=~ \log\left(\sum^{V}_{j=1} \exp\left({\frac{M_j}{V^2}}\right)\right) - M_k$.

Figures (9)

  • Figure 1: We compare the per-token cross-entropy loss for low ( blue) and high ( orange) frequency tokens when training different LLM architectures and sizes with varying weight decay $\lambda \in (0.0, 2.0)$ on the IMDB dataset using a BPE tokenizer with a vocabulary size of $32005$. As weigth decay increases, the model disproportionately disregards low-frequency tokens, which make up the vast majority of tokens in language datasets. Low-frequency tokens suffer from higher cross-entropy loss, while high-frequency tokens remain largely unaffected. Critically, the degradation of low-frequency token performance happens silently, as the average training loss, monitored by practitioners, remains largely unchanged across different levels of weight decay. An example of prompt with segmentation of which tokens are low and high frequency is provided in Figure \ref{['fig:colored_prompt']}.
  • Figure 2: Depiction of a training set prompt from IMDB with characters colored by token frequency: low-frequency ( blue) and high-frequency ( orange). The coloring threshold is 1026 (P99). Tokens appearing fewer than 1026 times in the dataset are blue, otherwise they are colored in orange.
  • Figure 3: Comparison of token frequency distribution and the ratio of low-frequency tokens across varying vocabulary sizes for the IMDB dataset. The left plot shows the token frequency distribution with cumulative frequency thresholds (50%, 80%, and 95%) marked. The right plot illustrates how the ratio of tokens below the 95th percentile increases with vocabulary size, converging to $\approx 0.85$.
  • Figure 4: Impact of Weight Decay on Cross-Entropy. Average training loss (blue) and class-balanced loss (orange) increase with weight decay. The class-balanced loss is more sensitive due to its focus on low-frequency tokens.
  • Figure 5: Token Learning Speed. Token learning speed ($0-1$) plotted against frequency (log-scale) for $\lambda = 1.0$. Colors represent token groups by frequency bins, highlighting variation across token frequencies.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Proposition 5.0
  • Proposition A.0
  • proof