Table of Contents
Fetching ...

Semantic Chunking and the Entropy of Natural Language

Weishun Zhong, Doron Sivan, Tankut Can, Mikhail Katkov, Misha Tsodyks

TL;DR

The paper develops a first-principles framework linking the multiscale semantic structure of text to its entropy rate. By pairing a recursive semantic chunking procedure with a tractable random $K$-ary tree ensemble, it derives a corpus-level entropy $h_K$ that matches LLM-based estimates $h_{\mathrm{LLM}}$ across diverse genres. It shows that level-wise chunk distributions become lognormal in the large-$N$ limit and exhibit a universal behavior once rescaled, tying semantic complexity to measurable redundancy. The work unifies token-level unpredictability with hierarchical semantics, interprets the branching factor $K$ as a cognitive/work-memory constraint, and provides a calculable link between semantic structure and language compression with potential implications for language understanding and processing efficiency.

Abstract

The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This entropy rate implies that English contains nearly 80 percent redundancy relative to the five bits per character expected for random text. We introduce a statistical model that attempts to capture the intricate multi-scale structure of natural language, providing a first-principles account of this redundancy level. Our model describes a procedure of self-similarly segmenting text into semantically coherent chunks down to the single-word level. The semantic structure of the text can then be hierarchically decomposed, allowing for analytical treatment. Numerical experiments with modern LLMs and open datasets suggest that our model quantitatively captures the structure of real texts at different levels of the semantic hierarchy. The entropy rate predicted by our model agrees with the estimated entropy rate of printed English. Moreover, our theory further reveals that the entropy rate of natural language is not fixed but should increase systematically with the semantic complexity of corpora, which are captured by the only free parameter in our model.

Semantic Chunking and the Entropy of Natural Language

TL;DR

The paper develops a first-principles framework linking the multiscale semantic structure of text to its entropy rate. By pairing a recursive semantic chunking procedure with a tractable random -ary tree ensemble, it derives a corpus-level entropy that matches LLM-based estimates across diverse genres. It shows that level-wise chunk distributions become lognormal in the large- limit and exhibit a universal behavior once rescaled, tying semantic complexity to measurable redundancy. The work unifies token-level unpredictability with hierarchical semantics, interprets the branching factor as a cognitive/work-memory constraint, and provides a calculable link between semantic structure and language compression with potential implications for language understanding and processing efficiency.

Abstract

The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This entropy rate implies that English contains nearly 80 percent redundancy relative to the five bits per character expected for random text. We introduce a statistical model that attempts to capture the intricate multi-scale structure of natural language, providing a first-principles account of this redundancy level. Our model describes a procedure of self-similarly segmenting text into semantically coherent chunks down to the single-word level. The semantic structure of the text can then be hierarchically decomposed, allowing for analytical treatment. Numerical experiments with modern LLMs and open datasets suggest that our model quantitatively captures the structure of real texts at different levels of the semantic hierarchy. The entropy rate predicted by our model agrees with the estimated entropy rate of printed English. Moreover, our theory further reveals that the entropy rate of natural language is not fixed but should increase systematically with the semantic complexity of corpora, which are captured by the only free parameter in our model.
Paper Structure (26 sections, 111 equations, 9 figures, 1 table)

This paper contains 26 sections, 111 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Schematic overview of two routes for estimating the entropy of a text. (a) The same text is provided to an LLM either to compute token log-probabilities (b) or to perform semantic chunking (e). (c) The LLM perplexity (equivalently, the per-token cross-entropy loss on the input) is converted into an LLM-based estimate of the text's entropy rate, $h_{\text{LLM}}$ (y-axis in (d)). (f) The chunking procedure applies recursive semantic segmentation until the single-token level, thereby parsing the document into a hierarchical tree of spans whose leaves are tokens (a "semantic tree"). (g) Example semantic tree for the input text. (h) The probability of observing the resulting tree structure is computed under a random-tree ensemble model. (i) Example combinatorial tree from the ensemble corresponding to the semantic tree in (g). (j) Ensemble probabilities are converted into a theoretical entropy-rate estimate $h_{\text{theory}}$, which closely matches $h_{\mathrm{LLM}}$ across diverse corpora ((d)).
  • Figure 2: Compare chunk size distributions between the random tree model and empirical semantic trees obtained via recursive semantic chunking. (a) Empirical versus theoretical chunk-size distribution at an intermediate tree level ($L=7$) for 20 narratives from RedditStories. (b) Normalized empirical chunk-size distributions (pooling chunk statistics from 100 narratives) compared with the theoretical prediction $f_L$ at multiple levels $L$.
  • Figure 3: Entropy rates across corpora. (a) Entropy rate as a function of $K$. Theoretical values correspond to $h_K$ in Eq. \ref{['eq:entropy_extensive']}. Empirical values are obtained by selecting the optimal $K$ for each corpus (Table \ref{['table:kl_div']}) and comparing $h_K$ to the corpus-level LLM estimate $h_{\mathrm{LLM}}$. Shaded bands indicate three approximate entropy-rate regimes that differentiate genres. (b) Entropy-rate estimates from individual random-tree realizations (simulated using Eq. \ref{['eq:p_split']}) concentrate around the predicted value as $N$ increases, consistent with the emergence of typical trees. (c) Per-token information for 100 RedditStories computed in two ways: from LLM perplexity and from the likelihood of empirical semantic trees obtained via chunking. As $N$ increases, both estimates fluctuate around the predicted entropy rate. (d)--(f) Cumulative LLM surprisal $-\sum_{i=1}^N \log P(t_i|t_{<i})$ for 100 texts from each corpus (color indicates text length $N$). The blue dash-dotted fit gives the LLM-based entropy-rate estimate $h_{\mathrm{LLM}}$. The red dashed line shows the theoretical prediction $h_K$ using the optimal $K$ for each corpus; its intercept is chosen for visual comparison (matched to the blue fit).
  • Figure 4: Comparison between the random tree model and empirical semantic trees obtained via recursive semantic chunking. (a) Theoretical scaling functions $f_L$ across levels $L$ shown on log--log axes. (b) When replotted using the $O(1)$ lognormal variable $x=(\ln s-\mu_L)/\sigma_L$, the curves in (a) collapse onto the universal $\mathcal{N}(0,1)$ distribution as $L$ increases. (c) Empirical chunk size distributions $\hat{f}_L$ (same as in Fig. \ref{['fig:tree_statistics']}(b), rebinned to be uniform in $\log s$) across levels $L$. (d) The empirical curves in (c) likewise collapse under the transformation $x=(\ln s-\hat{\mu}_L)/\hat{\sigma}_L$, consistent with the theoretical prediction.
  • Figure S1: Entropy of trees. (a) Both theory and enumeration are for $K=4$. $H(N) \approx 2.5 \text{ nats} \times N$. (b)-(c) Cumulative log probabilities (TI) for different stories (shown in different shades of green), blue solid curve corresponds to linear regression line, red dashed line corresponds to our theory computed for $K=4$ from Eq. \ref{['eq:entroy_dbl_sum']}. (b) 26 Labov stories taken from georgiou2025largesivan2025informationzhong2025random. (c) 1000 RedditStories.
  • ...and 4 more figures