Berezinskii--Kosterlitz--Thouless transition in a context-sensitive random language model
Yuma Toji, Jun Takahashi, Vwani Roychowdhury, Hideyuki Miyahara
TL;DR
This work asks whether language-generating systems can exhibit genuine phase-transition behavior. It introduces a context-sensitive random language model (CSG) that combines growth rules with long-range, context-dependent interactions inspired by the one-dimensional long-range Potts model, and analyzes ordering via an order parameter $M$, susceptibility, and Binder parameter. The authors demonstrate a Berezinskii--Kosterlitz--Thouless (BKT)–type transition with an extended critical phase, evidenced by power-law correlations and finite-size scaling yielding $T_c$, $\nu$, and $\gamma$ in representative parameter regimes; the transition’s presence depends on growth and alphabet size. The findings suggest that robust scaling laws in natural languages and modern language models may arise from intrinsic grammatical and long-range coherence mechanisms rather than fine-tuning, offering a thermodynamic lens on language structure and emergent capabilities. This minimal, analysable framework opens avenues to connect linguistic growth, attention-like long-range interactions, and critical phenomena in NLP systems.
Abstract
Several power-law critical properties involving different statistics in natural languages -- reminiscent of scaling properties of physical systems at or near phase transitions -- have been documented for decades. The recent rise of large language models has added further evidence and excitement by providing intriguing similarities with notions in physics such as scaling laws and emergent abilities. However, specific instances of classes of generative language models that exhibit phase transitions, as understood by the statistical physics community, are lacking. In this work, inspired by the one-dimensional Potts model in statistical physics, we construct a simple probabilistic language model that falls under the class of context-sensitive grammars, which we call the context-sensitive random language model, and numerically demonstrate an unambiguous phase transition in the framework of a natural language model. We explicitly show that a precisely defined order parameter -- that captures symbol frequency biases in the sentences generated by the language model -- changes from strictly zero to a strictly nonzero value (in the infinite-length limit of sentences), implying a mathematical singularity arising when tuning the parameter of the stochastic language model we consider. Furthermore, we identify the phase transition as a variant of the Berezinskii--Kosterlitz--Thouless (BKT) transition, which is known to exhibit critical properties not only at the transition point but also in the entire phase. This finding leads to the possibility that critical properties in natural languages may not require careful fine-tuning nor self-organized criticality, but are generically explained by the underlying connection between language structures and the BKT phases.
