Table of Contents
Fetching ...

The Statistical Signature of LLMs

Ortal Hadad, Edoardo Loru, Jacopo Nudo, Niccolò Di Marco, Matteo Cinelli, Walter Quattrociocchi

TL;DR

It is shown that lossless compression provides a simple, model-agnostic measure of statistical regularity that differentiates generative regimes directly from surface text and introduces a simple and robust framework for quantifying how generative systems reshape textual production.

Abstract

Large language models generate text through probabilistic sampling from high-dimensional distributions, yet how this process reshapes the structural statistical organization of language remains incompletely characterized. Here we show that lossless compression provides a simple, model-agnostic measure of statistical regularity that differentiates generative regimes directly from surface text. We analyze compression behavior across three progressively more complex information ecosystems: controlled human-LLM continuations, generative mediation of a knowledge infrastructure (Wikipedia vs. Grokipedia), and fully synthetic social interaction environments (Moltbook vs. Reddit). Across settings, compression reveals a persistent structural signature of probabilistic generation. In controlled and mediated contexts, LLM-produced language exhibits higher structural regularity and compressibility than human-written text, consistent with a concentration of output within highly recurrent statistical patterns. However, this signature shows scale dependence: in fragmented interaction environments the separation attenuates, suggesting a fundamental limit to surface-level distinguishability at small scales. This compressibility-based separation emerges consistently across models, tasks, and domains and can be observed directly from surface text without relying on model internals or semantic evaluation. Overall, our findings introduce a simple and robust framework for quantifying how generative systems reshape textual production, offering a structural perspective on the evolving complexity of communication.

The Statistical Signature of LLMs

TL;DR

It is shown that lossless compression provides a simple, model-agnostic measure of statistical regularity that differentiates generative regimes directly from surface text and introduces a simple and robust framework for quantifying how generative systems reshape textual production.

Abstract

Large language models generate text through probabilistic sampling from high-dimensional distributions, yet how this process reshapes the structural statistical organization of language remains incompletely characterized. Here we show that lossless compression provides a simple, model-agnostic measure of statistical regularity that differentiates generative regimes directly from surface text. We analyze compression behavior across three progressively more complex information ecosystems: controlled human-LLM continuations, generative mediation of a knowledge infrastructure (Wikipedia vs. Grokipedia), and fully synthetic social interaction environments (Moltbook vs. Reddit). Across settings, compression reveals a persistent structural signature of probabilistic generation. In controlled and mediated contexts, LLM-produced language exhibits higher structural regularity and compressibility than human-written text, consistent with a concentration of output within highly recurrent statistical patterns. However, this signature shows scale dependence: in fragmented interaction environments the separation attenuates, suggesting a fundamental limit to surface-level distinguishability at small scales. This compressibility-based separation emerges consistently across models, tasks, and domains and can be observed directly from surface text without relying on model internals or semantic evaluation. Overall, our findings introduce a simple and robust framework for quantifying how generative systems reshape textual production, offering a structural perspective on the evolving complexity of communication.
Paper Structure (12 sections, 6 equations, 5 figures)

This paper contains 12 sections, 6 equations, 5 figures.

Figures (5)

  • Figure 1: (A) Relationship between vocabulary entropy and compression ratio for texts generated from word distributions with fixed entropy. The colored points show the average values for Humans and LLMs. The bars indicate one standard deviation from the mean. The inset displays the density distribution of document length (number of words) for human-written and LLM-generated texts. (B) Compression ratios distribution for human-written texts, LLM-generated texts, and randomly generated texts. Higher compression ratios correspond to lower compressibility. Each group contains $1000$ documents. (C) Average compression ratio of human-written and LLM-generated texts according to the number of sentences in the text. The shaded area shows the interquartile range of the distribution of compressions. For LLM-generated text, as the length increases, compressibility also increases, unlike human text.
  • Figure 2: Distribution of structural and compression-based features across human-written and LLM-generated texts in the Human–AI Parallel Corpus.
  • Figure 3: Global feature importance based on mean absolute Shapley values for the Gradient Boosting Classifier. Bars indicate the average magnitude of each feature's contribution to the predicted probability of the Human class across the test set.
  • Figure 4: (A) Average compression ratio of Wikipedia and Grokipedia page texts as a function of the number of sentences in the text. The shaded area represents the interquartile range of the compression ratio distribution. (B) Distribution of Conditional Compression Ratio, Normalized Word-Level Entropy, Mean Repetition Distance, and Repetition Distance Variability across texts from Wikipedia and Grokipedia pages.
  • Figure 5: (A) Average compression ratio of Moltbook and Reddit comments as a function of the number of sentences in the text. The shaded area represents the interquartile range of the compression ratio distribution. (B) Distribution of Normalized Compression Distance, Prefix Ratio Trend, Unique Word ratio, and Conditional Compression ratio across Moltbook and Reddit posts.