Table of Contents
Fetching ...

Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text

Kai Kugler

TL;DR

This study asks whether Martin's Law—linking word frequency to polysemy—holds in text generated by large language models and how it evolves during training. It introduces a data-driven, clustering-based polysemy measure applied to contextualized embeddings from four Pythia models across 30 checkpoints, using DBSCAN to identify senses and Spearman correlations to quantify relationships. The findings reveal a non-monotonic trajectory: emergence near $10^2$ steps, peak alignment around $10^4$ steps with $\\rho>0.6$, and degradation thereafter, with smaller models showing semantic collapse while larger models degrade more gracefully; a stable frequency–specificity tradeoff around $\\rho \approx -0.3$ suggests a flatter semantic space than human language. The work identifies a capacity threshold near ~$200$M parameters and proposes a framework for evaluating emergent linguistic structure in LLMs, highlighting the importance of checkpoint timing for realistic linguistic behavior and guiding future cross-model and cross-linguistic investigations.

Abstract

We present the first systematic investigation of Martin's Law - the empirical relationship between word frequency and polysemy - in text generated by neural language models during training. Using DBSCAN clustering of contextualized embeddings as an operationalization of word senses, we analyze four Pythia models (70M-1B parameters) across 30 training checkpoints. Our results reveal a non-monotonic developmental trajectory: Martin's Law emerges around checkpoint 100, reaches peak correlation (r > 0.6) at checkpoint 104, then degrades by checkpoint 105. Smaller models (70M, 160M) experience catastrophic semantic collapse at late checkpoints, while larger models (410M, 1B) show graceful degradation. The frequency-specificity trade-off remains stable (r $\approx$ -0.3) across all models. These findings suggest that compliance with linguistic regularities in LLM-generated text is not monotonically increasing with training, but instead follows a balanced trajectory with an optimal semantic window. This work establishes a novel methodology for evaluating emergent linguistic structure in neural language models.

Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text

TL;DR

This study asks whether Martin's Law—linking word frequency to polysemy—holds in text generated by large language models and how it evolves during training. It introduces a data-driven, clustering-based polysemy measure applied to contextualized embeddings from four Pythia models across 30 checkpoints, using DBSCAN to identify senses and Spearman correlations to quantify relationships. The findings reveal a non-monotonic trajectory: emergence near steps, peak alignment around steps with , and degradation thereafter, with smaller models showing semantic collapse while larger models degrade more gracefully; a stable frequency–specificity tradeoff around suggests a flatter semantic space than human language. The work identifies a capacity threshold near ~M parameters and proposes a framework for evaluating emergent linguistic structure in LLMs, highlighting the importance of checkpoint timing for realistic linguistic behavior and guiding future cross-model and cross-linguistic investigations.

Abstract

We present the first systematic investigation of Martin's Law - the empirical relationship between word frequency and polysemy - in text generated by neural language models during training. Using DBSCAN clustering of contextualized embeddings as an operationalization of word senses, we analyze four Pythia models (70M-1B parameters) across 30 training checkpoints. Our results reveal a non-monotonic developmental trajectory: Martin's Law emerges around checkpoint 100, reaches peak correlation (r > 0.6) at checkpoint 104, then degrades by checkpoint 105. Smaller models (70M, 160M) experience catastrophic semantic collapse at late checkpoints, while larger models (410M, 1B) show graceful degradation. The frequency-specificity trade-off remains stable (r -0.3) across all models. These findings suggest that compliance with linguistic regularities in LLM-generated text is not monotonically increasing with training, but instead follows a balanced trajectory with an optimal semantic window. This work establishes a novel methodology for evaluating emergent linguistic structure in neural language models.

Paper Structure

This paper contains 25 sections, 1 figure.

Figures (1)

  • Figure 1: Semantic emergence across training.Top-left: Martin's Law (frequency-polysemy correlation) shows non-monotonic trajectory with peak at $\sim10^4$ steps. Top-right: Mean polysemy (semantic differentiation) collapses in small models at late checkpoints. Bottom-left: Frequency-specificity tradeoff remains stable across training. Bottom-right: Polysemous word count diverges by model scale, with catastrophic collapse in small models.