Table of Contents
Fetching ...

Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale

James A. Michaelov, Roger P. Levy, Benjamin K. Bergen

TL;DR

The paper investigates whether autoregressive language model behavior during pretraining follows a consistent trajectory across architecture, training data, and scale. An analysis of 1,418 checkpoints across Parc-Pythia, Parc-Mamba, Parc-RWKV, and Open-GPT2 using the NaWoCo dataset shows that three simple heuristics—unigram frequency, $n$-gram probability, and contextual semantic similarity—explain up to $98\%$ of word-level log-probability variance, with consistent behavioral phases emerging during training. The results hold across Transformers, state-space, and recurrent architectures, and across The Pile vs OpenWebText data, suggesting a common learning dynamics regardless of model details. This points to the autoregressive objective as a dominant factor shaping learning, with higher-order $n$-gram reliance developing later and semantic similarity contributing early, offering a simplified lens for understanding LM development and downstream capabilities.

Abstract

We show that across architecture (Transformer vs. Mamba vs. RWKV), training dataset (OpenWebText vs. The Pile), and scale (14 million parameters to 12 billion parameters), autoregressive language models exhibit highly consistent patterns of change in their behavior over the course of pretraining. Based on our analysis of over 1,400 language model checkpoints on over 110,000 tokens of English, we find that up to 98% of the variance in language model behavior at the word level can be explained by three simple heuristics: the unigram probability (frequency) of a given word, the $n$-gram probability of the word, and the semantic similarity between the word and its context. Furthermore, we see consistent behavioral phases in all language models, with their predicted probabilities for words overfitting to those words' $n$-gram probabilities for increasing $n$ over the course of training. Taken together, these results suggest that learning in neural language models may follow a similar trajectory irrespective of model details.

Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale

TL;DR

The paper investigates whether autoregressive language model behavior during pretraining follows a consistent trajectory across architecture, training data, and scale. An analysis of 1,418 checkpoints across Parc-Pythia, Parc-Mamba, Parc-RWKV, and Open-GPT2 using the NaWoCo dataset shows that three simple heuristics—unigram frequency, -gram probability, and contextual semantic similarity—explain up to of word-level log-probability variance, with consistent behavioral phases emerging during training. The results hold across Transformers, state-space, and recurrent architectures, and across The Pile vs OpenWebText data, suggesting a common learning dynamics regardless of model details. This points to the autoregressive objective as a dominant factor shaping learning, with higher-order -gram reliance developing later and semantic similarity contributing early, offering a simplified lens for understanding LM development and downstream capabilities.

Abstract

We show that across architecture (Transformer vs. Mamba vs. RWKV), training dataset (OpenWebText vs. The Pile), and scale (14 million parameters to 12 billion parameters), autoregressive language models exhibit highly consistent patterns of change in their behavior over the course of pretraining. Based on our analysis of over 1,400 language model checkpoints on over 110,000 tokens of English, we find that up to 98% of the variance in language model behavior at the word level can be explained by three simple heuristics: the unigram probability (frequency) of a given word, the -gram probability of the word, and the semantic similarity between the word and its context. Furthermore, we see consistent behavioral phases in all language models, with their predicted probabilities for words overfitting to those words' -gram probabilities for increasing over the course of training. Taken together, these results suggest that learning in neural language models may follow a similar trajectory irrespective of model details.

Paper Structure

This paper contains 35 sections, 3 equations, 19 figures, 1 table.

Figures (19)

  • Figure 1: Pearson correlation coefficient $r$ between language model log-probability and heuristic metrics ($n$-gram log-probability and word embedding cosine similarity). We show the mean values for all models across seeds and their 95% confidence intervals.
  • Figure 2: (A) Regression coefficients of the three heuristics over the course of training under different conditions, specifically, whether the $n$-gram data is the same as that on which the language model was trained (matched) or not (unmatched), and whether SGPT-weighted contextual semantic similarity metric is calculated using Common-Crawl-based or Wikipedia-based fastText word vectors. (B) Proportion of the variance in language model log-probability explained by the regressions in \ref{['fig:predictors']}. We also report the $R^2$ values of the same regressions' predictions on the validation set.
  • Figure 3: Spearman correlation coefficient $\rho$ between language model log-probability and heuristic metrics ($n$-gram log-probability and word embedding cosine similarity). We show the mean values for all models across seeds and their 95% confidence intervals.
  • Figure 4: Seed-level Pearson correlation coefficient $r$ between language model log-probability and heuristic metrics ($n$-gram log-probability and word embedding cosine similarity).
  • Figure 5: Spearman correlation coefficient $\rho$ between language model log-probability and heuristic metrics ($n$-gram log-probability and word embedding cosine similarity) at the seed level.
  • ...and 14 more figures