Table of Contents
Fetching ...

The Birth of Knowledge: Emergent Features across Time, Space, and Scale in Large Language Models

Shashata Sawmya, Micah Adler, Nir Shavit

TL;DR

This work investigates how interpretable, categorical features emerge in large language models as a function of training time, transformer depth, and model scale. By using sparse autoencoders to induce and interpret latent features and applying the EyeSee/AutoInterp pipeline, the authors map when and where semantic concepts appear in a Pythia-12B model across 25 training checkpoints and multiple model sizes. They report clear temporal and scale thresholds for feature emergence and reveal a surprising reactivation of early-layer features in later layers, challenging simple hierarchical assumptions about representation dynamics. The findings provide mechanistic insight into how knowledge representations crystallize in LLMs and offer scalable interpretability diagnostics for future model development and evaluation.

Abstract

This paper studies the emergence of interpretable categorical features within large language models (LLMs), analyzing their behavior across training checkpoints (time), transformer layers (space), and varying model sizes (scale). Using sparse autoencoders for mechanistic interpretability, we identify when and where specific semantic concepts emerge within neural activations. Results indicate clear temporal and scale-specific thresholds for feature emergence across multiple domains. Notably, spatial analysis reveals unexpected semantic reactivation, with early-layer features re-emerging at later layers, challenging standard assumptions about representational dynamics in transformer models.

The Birth of Knowledge: Emergent Features across Time, Space, and Scale in Large Language Models

TL;DR

This work investigates how interpretable, categorical features emerge in large language models as a function of training time, transformer depth, and model scale. By using sparse autoencoders to induce and interpret latent features and applying the EyeSee/AutoInterp pipeline, the authors map when and where semantic concepts appear in a Pythia-12B model across 25 training checkpoints and multiple model sizes. They report clear temporal and scale thresholds for feature emergence and reveal a surprising reactivation of early-layer features in later layers, challenging simple hierarchical assumptions about representation dynamics. The findings provide mechanistic insight into how knowledge representations crystallize in LLMs and offer scalable interpretability diagnostics for future model development and evaluation.

Abstract

This paper studies the emergence of interpretable categorical features within large language models (LLMs), analyzing their behavior across training checkpoints (time), transformer layers (space), and varying model sizes (scale). Using sparse autoencoders for mechanistic interpretability, we identify when and where specific semantic concepts emerge within neural activations. Results indicate clear temporal and scale-specific thresholds for feature emergence across multiple domains. Notably, spatial analysis reveals unexpected semantic reactivation, with early-layer features re-emerging at later layers, challenging standard assumptions about representational dynamics in transformer models.

Paper Structure

This paper contains 30 sections, 9 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Axes of our emergent-knowledge probe. We track how interpretable, categorical features surface in a language model over time (training checkpoints), space (depth across transformer blocks), and scale (parameter count), progressing from sparse or absent concepts (grey bulbs) to rich representations (yellow bulbs)
  • Figure 2: Hyperparameter sweep for sparse–autoencoder interpretability. Left: mean F1‐score as the activation budget $k$ varies with width fixed at $h=256$; right: mean F1‐score as the latent width $h$ varies with $k=1$. The optimal setting for our data is $k=1,\,h=512$, which maximises mean F1.
  • Figure 3: Activation patterns of categorical concepts in a 12B-parameter language model across training checkpoints. The left panel illustrates the global activation trajectory, while panels on the right display domain-specific emergence patterns, highlighting distinct activation timings for various knowledge concepts.
  • Figure 4: Cosine similarity (left) reveals three macro blocks (embedding, processing core, output), while SAE probes (right) show that feature directions are highly local in depth—with a striking echo between the first and last layers—indicating that the network temporarily hides early lexical axes during computation before restoring them for final prediction.
  • Figure 5: Concept-activation saturation with model scale. Left: percentage of all concepts or features which is activated for each Pythia checkpoint from 14 M to 12 B parameters (log scale). A single inflection between the 160 M and 410 M models raises activation from $<\!5\%$ to $\approx95\%$, after which the curve plateaus. Right: per-domain activation profiles show similar critical points for most areas, while Business concepts rise more gradually.
  • ...and 2 more figures