Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text
Kai Kugler
TL;DR
This study asks whether Martin's Law—linking word frequency to polysemy—holds in text generated by large language models and how it evolves during training. It introduces a data-driven, clustering-based polysemy measure applied to contextualized embeddings from four Pythia models across 30 checkpoints, using DBSCAN to identify senses and Spearman correlations to quantify relationships. The findings reveal a non-monotonic trajectory: emergence near $10^2$ steps, peak alignment around $10^4$ steps with $\\rho>0.6$, and degradation thereafter, with smaller models showing semantic collapse while larger models degrade more gracefully; a stable frequency–specificity tradeoff around $\\rho \approx -0.3$ suggests a flatter semantic space than human language. The work identifies a capacity threshold near ~$200$M parameters and proposes a framework for evaluating emergent linguistic structure in LLMs, highlighting the importance of checkpoint timing for realistic linguistic behavior and guiding future cross-model and cross-linguistic investigations.
Abstract
We present the first systematic investigation of Martin's Law - the empirical relationship between word frequency and polysemy - in text generated by neural language models during training. Using DBSCAN clustering of contextualized embeddings as an operationalization of word senses, we analyze four Pythia models (70M-1B parameters) across 30 training checkpoints. Our results reveal a non-monotonic developmental trajectory: Martin's Law emerges around checkpoint 100, reaches peak correlation (r > 0.6) at checkpoint 104, then degrades by checkpoint 105. Smaller models (70M, 160M) experience catastrophic semantic collapse at late checkpoints, while larger models (410M, 1B) show graceful degradation. The frequency-specificity trade-off remains stable (r $\approx$ -0.3) across all models. These findings suggest that compliance with linguistic regularities in LLM-generated text is not monotonically increasing with training, but instead follows a balanced trajectory with an optimal semantic window. This work establishes a novel methodology for evaluating emergent linguistic structure in neural language models.
