Table of Contents
Fetching ...

Convergence and Divergence of Language Models under Different Random Seeds

Finlay Fehlauer, Kyle Mahowald, Tiago Pimentel

TL;DR

This work analyzes how language models trained with different random seeds converge or diverge, defining convergence via the negative expected $KL$ divergence between seed-specific distributions. It uncovers a robust four-phase convergence pattern that unfolds during training, with larger models reconverging faster and smaller models often failing to reach a shared solution, even as cross-entropy improves. The study also shows token-frequency and linguistic class (function vs content words) shape convergence, with frequent/function words stabilizing more reliably and infrequent/content words remaining volatile; conditional convergence analyses and downstream-task replication (BLiMP, MultiBERT) suggest these dynamics generalize beyond a single task. Overall, the findings illuminate stability and reproducibility considerations in LM training, implying a minimum effective model size and highlighting how data properties and tokenization influence convergence trajectories.

Abstract

In this paper, we investigate the convergence of language models (LMs) trained under different random seeds, measuring convergence as the expected per-token Kullback--Leibler (KL) divergence across seeds. By comparing LM convergence as a function of model size and training checkpoint, we identify a four-phase convergence pattern: (i) an initial uniform phase, (ii) a sharp-convergence phase, (iii) a sharp-divergence phase, and (iv) a slow-reconvergence phase. Further, we observe that larger models reconverge faster in later training stages, while smaller models never actually reconverge; these results suggest that a certain model size may be necessary to learn stable distributions. Restricting our analysis to specific token frequencies or part-of-speech (PoS) tags further reveals that convergence is uneven across linguistic categories: frequent tokens and function words converge faster and more reliably than their counterparts (infrequent tokens and content words). Overall, our findings highlight factors that influence the stability of the learned distributions in model training.

Convergence and Divergence of Language Models under Different Random Seeds

TL;DR

This work analyzes how language models trained with different random seeds converge or diverge, defining convergence via the negative expected divergence between seed-specific distributions. It uncovers a robust four-phase convergence pattern that unfolds during training, with larger models reconverging faster and smaller models often failing to reach a shared solution, even as cross-entropy improves. The study also shows token-frequency and linguistic class (function vs content words) shape convergence, with frequent/function words stabilizing more reliably and infrequent/content words remaining volatile; conditional convergence analyses and downstream-task replication (BLiMP, MultiBERT) suggest these dynamics generalize beyond a single task. Overall, the findings illuminate stability and reproducibility considerations in LM training, implying a minimum effective model size and highlighting how data properties and tokenization influence convergence trajectories.

Abstract

In this paper, we investigate the convergence of language models (LMs) trained under different random seeds, measuring convergence as the expected per-token Kullback--Leibler (KL) divergence across seeds. By comparing LM convergence as a function of model size and training checkpoint, we identify a four-phase convergence pattern: (i) an initial uniform phase, (ii) a sharp-convergence phase, (iii) a sharp-divergence phase, and (iv) a slow-reconvergence phase. Further, we observe that larger models reconverge faster in later training stages, while smaller models never actually reconverge; these results suggest that a certain model size may be necessary to learn stable distributions. Restricting our analysis to specific token frequencies or part-of-speech (PoS) tags further reveals that convergence is uneven across linguistic categories: frequent tokens and function words converge faster and more reliably than their counterparts (infrequent tokens and content words). Overall, our findings highlight factors that influence the stability of the learned distributions in model training.

Paper Structure

This paper contains 33 sections, 7 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Estimated $\mathop{\mathrm{\mathbb{E}}}\limits[{\color{purple}\mathtt{conv}}]$ across training steps ($x$-axis). Shaded areas represent $1\sigma$ confidence intervals.
  • Figure 2: Estimated $\mathop{\mathrm{\mathbb{E}}}\limits[{\color{purple}\mathtt{conv}}]$ across training steps: (left) on the Pythia model suite on BLiMP with $1\bar{\sigma}$ confidence intervals, (right) on the MultiBERT model suite on masked language modelling with $1\sigma$ confidence intervals.
  • Figure 3: $\mathop{\mathrm{\mathbb{E}}}\limits_{{\color{purple}\mathcal{S}}_t}[{\color{purple}\mathtt{conv}}]$ of selected models with $1\sigma$ confidence intervals. Conditioning property: (left) frequency; (center) parts of speech; (right) final surprisal.
  • Figure 4: ${\color{purple}\mathtt{conv}}({\color{purple}\mathbf{s}}_{\scaleto{<t}{4.5pt}})$ across training, with shaded areas representing its standard across contexts ${\color{purple}\mathbf{s}}_{\scaleto{<t}{4.5pt}}$.
  • Figure 5: Illustration of the mapping of part of speech tags to tokens.
  • ...and 6 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3