Evolution of Concepts in Language Model Pre-Training

Xuyang Ge; Wentao Shu; Jiaxing Wu; Yunhua Zhou; Zhengfu He; Xipeng Qiu

Evolution of Concepts in Language Model Pre-Training

Xuyang Ge, Wentao Shu, Jiaxing Wu, Yunhua Zhou, Zhengfu He, Xipeng Qiu

TL;DR

This work investigates how language models internalize capabilities during pre-training by introducing crosscoders, a sparse dictionary learning approach that tracks linear features across training snapshots. By aligning features into a unified representation, the authors uncover a two-phase learning dynamic: an initial statistical-learning phase that captures coarse unigram and bigram patterns, followed by a feature-learning phase where more complex concepts emerge and reorganize the activation space. They demonstrate causal connections between feature evolution and downstream performance via attribution-based circuit tracing, and show that decoder norms serve as proxies for feature strength, with emergent features generally persisting and forming complex, context-sensitive patterns. The study provides a principled framework to observe fine-grained learning dynamics, suggesting a universal turning point around step $1{,}000$ and highlighting limitations related to generalizability and task complexity, while offering reproducible methods and code to extend mechanistic interpretability in pre-training regimes.

Abstract

Language models obtain extensive capabilities through pre-training. However, the pre-training process remains a black box. In this work, we track linear interpretable feature evolution across pre-training snapshots using a sparse dictionary learning method called crosscoders. We find that most features begin to form around a specific point, while more complex patterns emerge in later training stages. Feature attribution analyses reveal causal connections between feature evolution and downstream performance. Our feature-level observations are highly consistent with previous findings on Transformer's two-stage learning process, which we term a statistical learning phase and a feature learning phase. Our work opens up the possibility to track fine-grained representation progress during language model learning dynamics.

Evolution of Concepts in Language Model Pre-Training

TL;DR

Abstract

Evolution of Concepts in Language Model Pre-Training

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (35)