Evidence of Phase Transitions in Small Transformer-Based Language Models

Noah Hong; Tao Hong

Evidence of Phase Transitions in Small Transformer-Based Language Models

Noah Hong, Tao Hong

TL;DR

The paper investigates phase-transition–like reorganizations in a small transformer trained on a character‑level corpus, proposing Poisson‑centered diagnostics to detect abrupt internal reorganizations directly in linear training space. Using a 3.6M‑parameter GPT‑style model trained on Tiny Shakespeare, the authors track dispersion, KL divergence, word length, and vocabulary dynamics to reveal synchronized discontinuities that are invisible to standard loss curves. They show an early lexical reorganization where fragments cohere into multi‑character words, with a temporary degradation suggesting barrier crossing between memorization and generalization. The work provides a mechanistic, scale‑invariant perspective on emergent linguistic structure and offers practical diagnostics for understanding nonlinear learning dynamics in language models.

Abstract

Phase transitions have been proposed as the origin of emergent abilities in large language models (LLMs), where new capabilities appear abruptly once models surpass critical thresholds of scale. Prior work, such as that of Wei et al., demonstrated these phenomena under model and data scaling, with transitions revealed after applying a log scale to training compute. In this work, we ask three complementary questions: (1) Are phase transitions unique to large models, or can they also be observed in small transformer-based language models? (2) Can such transitions be detected directly in linear training space, rather than only after log rescaling? and (3) Can these transitions emerge at early stages of training? To investigate, we train a small GPT-style transformer on a character-level corpus and analyze the evolution of vocabulary usage throughout training. We track the average word length, the number of correct versus incorrect words, and shifts in vocabulary diversity. Building on these measures, we apply Poisson and sub-Poisson statistics to quantify how words connect and reorganize. This combined analysis reveals a distinct transition point during training. Notably, these transitions are not apparent in standard loss or validation curves, but become visible through our vocabulary- and statistics-based probes. Our findings suggest that phase-transition reorganizations are a general feature of language model training, observable even in modest models, detectable directly in linear training space, and occurring surprisingly early as coherence emerges. This perspective provides new insight into the nonlinear dynamics of language model training and underscores the importance of tailored metrics for uncovering phase transition behaviors

Evidence of Phase Transitions in Small Transformer-Based Language Models

TL;DR

Abstract

Evidence of Phase Transitions in Small Transformer-Based Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)