Table of Contents
Fetching ...

Evolution of Concepts in Language Model Pre-Training

Xuyang Ge, Wentao Shu, Jiaxing Wu, Yunhua Zhou, Zhengfu He, Xipeng Qiu

TL;DR

This work investigates how language models internalize capabilities during pre-training by introducing crosscoders, a sparse dictionary learning approach that tracks linear features across training snapshots. By aligning features into a unified representation, the authors uncover a two-phase learning dynamic: an initial statistical-learning phase that captures coarse unigram and bigram patterns, followed by a feature-learning phase where more complex concepts emerge and reorganize the activation space. They demonstrate causal connections between feature evolution and downstream performance via attribution-based circuit tracing, and show that decoder norms serve as proxies for feature strength, with emergent features generally persisting and forming complex, context-sensitive patterns. The study provides a principled framework to observe fine-grained learning dynamics, suggesting a universal turning point around step $1{,}000$ and highlighting limitations related to generalizability and task complexity, while offering reproducible methods and code to extend mechanistic interpretability in pre-training regimes.

Abstract

Language models obtain extensive capabilities through pre-training. However, the pre-training process remains a black box. In this work, we track linear interpretable feature evolution across pre-training snapshots using a sparse dictionary learning method called crosscoders. We find that most features begin to form around a specific point, while more complex patterns emerge in later training stages. Feature attribution analyses reveal causal connections between feature evolution and downstream performance. Our feature-level observations are highly consistent with previous findings on Transformer's two-stage learning process, which we term a statistical learning phase and a feature learning phase. Our work opens up the possibility to track fine-grained representation progress during language model learning dynamics.

Evolution of Concepts in Language Model Pre-Training

TL;DR

This work investigates how language models internalize capabilities during pre-training by introducing crosscoders, a sparse dictionary learning approach that tracks linear features across training snapshots. By aligning features into a unified representation, the authors uncover a two-phase learning dynamic: an initial statistical-learning phase that captures coarse unigram and bigram patterns, followed by a feature-learning phase where more complex concepts emerge and reorganize the activation space. They demonstrate causal connections between feature evolution and downstream performance via attribution-based circuit tracing, and show that decoder norms serve as proxies for feature strength, with emergent features generally persisting and forming complex, context-sensitive patterns. The study provides a principled framework to observe fine-grained learning dynamics, suggesting a universal turning point around step and highlighting limitations related to generalizability and task complexity, while offering reproducible methods and code to extend mechanistic interpretability in pre-training regimes.

Abstract

Language models obtain extensive capabilities through pre-training. However, the pre-training process remains a black box. In this work, we track linear interpretable feature evolution across pre-training snapshots using a sparse dictionary learning method called crosscoders. We find that most features begin to form around a specific point, while more complex patterns emerge in later training stages. Feature attribution analyses reveal causal connections between feature evolution and downstream performance. Our feature-level observations are highly consistent with previous findings on Transformer's two-stage learning process, which we term a statistical learning phase and a feature learning phase. Our work opens up the possibility to track fine-grained representation progress during language model learning dynamics.

Paper Structure

This paper contains 55 sections, 12 equations, 35 figures, 1 table.

Figures (35)

  • Figure 1: Overview of our method. The crosscoder is trained to decompose activations of multiple pre-training snapshots (left) into sparse features (right).
  • Figure 2: Explained variances versus L0 norms of our crosscoders.
  • Figure 3: Comparison between crosscoders and per-snapshot SAEs. (a) The explained variance of crosscoders versus SAEs at each snapshot. (b) The L0 norm of crosscoders versus SAEs at each snapshot. (c) The Pareto frontier comparison of crosscoders and SAEs trained on the final snapshot.
  • Figure 4: Overview of cross-snapshot feature decoder norm evolution. Features are extracted by a 98,304-feature crosscoder on Pythia-160M (top) and a 32,768-feature crosscoder on Pythia-6.9B (bottom).
  • Figure 5: Statistics of emergent features in a 98,304-feature crosscoder on Pythia-160M. (a) Distribution of peak emergence times. (b) Distribution of feature lifetime. (c) Mean projection of each feature's decoder vector of snapshot $\theta_i$ onto its decoder vector of snapshot $\theta_j$.
  • ...and 30 more figures