Table of Contents
Fetching ...

Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipf's Law

Frederik Kunstner, Francis Bach

TL;DR

This work analyzes optimization dynamics for a linear bigram model under Zipf-like heavy-tailed word distributions. By deriving closed-form GD and SD dynamics under power-law frequencies and performing a scaled-dimension analysis, it reveals distinct regimes driven by the tail exponent $\alpha$: GD can require iteration counts that grow with vocabulary size when $\alpha\le 1$, while SD achieves a much more favorable $\sqrt{d}$-scaling in the same regime, notably for Zipf data $\alpha=1$. The authors extend scaling-law analyses to the $\alpha\le 1$ regime—beyond prior work that assumed $\alpha>1$—and validate predictions with OpenWebText experiments. The results highlight the potential of sign-descent-like methods and data-tail-aware strategies to mitigate ill-conditioning in large-vocabulary language-model training, providing guidance for scaling optimizers with vocabulary size in practice.

Abstract

Recent works have highlighted optimization difficulties faced by gradient descent in training the first and last layers of transformer-based language models, which are overcome by optimizers such as Adam. These works suggest that the difficulty is linked to the heavy-tailed distribution of words in text data, where the frequency of the $k$th most frequent word $π_k$ is proportional to $1/k$, following Zipf's law. To better understand the impact of the data distribution on training performance, we study a linear bigram model for next-token prediction when the tokens follow a power law $π_k \propto 1/k^α$ parameterized by the exponent $α> 0$. We derive optimization scaling laws for deterministic gradient descent and sign descent as a proxy for Adam as a function of the exponent $α$. Existing theoretical investigations in scaling laws assume that the eigenvalues of the data decay as a power law with exponent $α> 1$. This assumption effectively makes the problem ``finite dimensional'' as most of the loss comes from a few of the largest eigencomponents. In comparison, we show that the problem is more difficult when the data have heavier tails. The case $α= 1$ as found in text data is ``worst-case'' for gradient descent, in that the number of iterations required to reach a small relative error scales almost linearly with dimension. While the performance of sign descent also depends on the dimension, for Zipf-distributed data the number of iterations scales only with the square-root of the dimension, leading to a large improvement for large vocabularies.

Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipf's Law

TL;DR

This work analyzes optimization dynamics for a linear bigram model under Zipf-like heavy-tailed word distributions. By deriving closed-form GD and SD dynamics under power-law frequencies and performing a scaled-dimension analysis, it reveals distinct regimes driven by the tail exponent : GD can require iteration counts that grow with vocabulary size when , while SD achieves a much more favorable -scaling in the same regime, notably for Zipf data . The authors extend scaling-law analyses to the regime—beyond prior work that assumed —and validate predictions with OpenWebText experiments. The results highlight the potential of sign-descent-like methods and data-tail-aware strategies to mitigate ill-conditioning in large-vocabulary language-model training, providing guidance for scaling optimizers with vocabulary size in practice.

Abstract

Recent works have highlighted optimization difficulties faced by gradient descent in training the first and last layers of transformer-based language models, which are overcome by optimizers such as Adam. These works suggest that the difficulty is linked to the heavy-tailed distribution of words in text data, where the frequency of the th most frequent word is proportional to , following Zipf's law. To better understand the impact of the data distribution on training performance, we study a linear bigram model for next-token prediction when the tokens follow a power law parameterized by the exponent . We derive optimization scaling laws for deterministic gradient descent and sign descent as a proxy for Adam as a function of the exponent . Existing theoretical investigations in scaling laws assume that the eigenvalues of the data decay as a power law with exponent . This assumption effectively makes the problem ``finite dimensional'' as most of the loss comes from a few of the largest eigencomponents. In comparison, we show that the problem is more difficult when the data have heavier tails. The case as found in text data is ``worst-case'' for gradient descent, in that the number of iterations required to reach a small relative error scales almost linearly with dimension. While the performance of sign descent also depends on the dimension, for Zipf-distributed data the number of iterations scales only with the square-root of the dimension, leading to a large improvement for large vocabularies.

Paper Structure

This paper contains 10 sections, 3 theorems, 21 equations, 4 figures.

Key Result

Theorem 1.1

Consider the linear bigram model when the dimensionality $d$ is large. The number of iterations $t$ required to reach $\varepsilon$ relative accuracy with gradient descent scales as follows. By relative accuracy, we mean that $\mathcal{L}_d(t) - \mathcal{L}_d^* = \varepsilon(\mathcal{L}_d(0) - \mathcal{L}_d^*)$, where $\mathcal{L}_d(t)$ is the loss after $t$ steps and $\mathcal{L}_d^*$ is the min

Figures (4)

  • Figure 1: Gradient descent (GD) scales badly with vocabulary size when the data is Zipfian. Relative error on a linear bigram problem with squared loss trained with GD with vocabulary size $d$ when the word class frequencies follow $\pi_k \propto 1/k^\alpha$. For $\alpha \leq 1$ (left, middle) the performance degrades with vocabulary size, with worst scaling for Zipf-distributed data ($\alpha = 1$). When the frequencies have lighter tails ($\alpha = 2$, right) GD works well for all vocabulary sizes. Our objective is to derive scaling laws explaining this behavior.
  • Figure 2: Our scaling predicts the behavior of gradient descent and sign descent on real data. Left: the convergence of gradient descent (GD) and sign descent (SD) is close to our asymptotic prediction (, ) on a bigram model with $32$k tokens on OpenWebText, although not exactly due to the finite dimension and our simplified model of the frequencies in \ref{['ass:conditional-distribution']}. Middle/Right: as $d$ grows, the number of iterations required to reach $\varepsilon$ relative error matches our predictions, showing that SD scales better with dimension for small $\varepsilon$. We show results on real data (dots) against the scaling of $cd^{1-\varepsilon}$ for GD and $c d{}^{1/2}$ for SD (dashes) where $c$ is fit to the data.
  • Figure 3: Token frequencies and conditional frequencies approximately follow Zipf's law. The approximation of \ref{['ass:conditional-distribution']} () provides a reasonable approximation of the frequencies (left) and conditional frequencies (right) on text data, computed on OpenWebText for a vocabulary of $10^4$ words. For a word $k$, the right plot shows the median and quantiles of the distribution $\pi_{k\,|\, j}$ for $j \in [d]$.
  • Figure 4: Scaling of gradient descent on power-law data with exponent $\alpha$ (\ref{['thm:gradient-descent']}). The dynamics of gradient descent on the linear bigram model with data satisfying \ref{['ass:conditional-distribution']} converge to our scaling law (, \ref{['thm:gradient-descent']}) as $d$ grows. Achieving a relative error $\varepsilon$ requires scaling the iteration budget $T$ with $d^\alpha$ for $\alpha < 1$, $T$ with $d^{1-\varepsilon}$ for $\alpha = 1$, and no scaling for $\alpha > 1$.

Theorems & Definitions (5)

  • Theorem 1.1: Informal
  • Proposition 2.2
  • proof
  • Theorem 3.0: Scaling for gradient descent
  • proof