Table of Contents
Fetching ...

Identifying the Periodicity of Information in Natural Language

Yulin Ou, Yu Wang, Yang Xu, Hendrik Buschmeier

TL;DR

This work introduces AutoPeriod of Surprisal (APS), a method to detect periods in the surprisal sequence of texts by combining a periodogram-based hint extraction with an autocorrelation-based validation step. APS reveals that a notable portion of human language exhibits periodic information, including long periods not explained by common structural units, and the detected periods correlate with, yet extend beyond, traditional EDU/sentence/paragraph lengths. A Harmonic Regression sanity check confirms the presence of periodic components when scaling models by APS-derived cues, and the method demonstrates improved predictability in periods-filtered data. Comparisons between human- and machine-generated texts indicate stronger long-range periodicity in LLM outputs, highlighting APS’s potential for generation-detection tasks and providing a new lens on information dynamics in natural language.

Abstract

Recent theoretical advancement of information density in natural language has brought the following question on desk: To what degree does natural language exhibit periodicity pattern in its encoded information? We address this question by introducing a new method called AutoPeriod of Surprisal (APS). APS adopts a canonical periodicity detection algorithm and is able to identify any significant periods that exist in the surprisal sequence of a single document. By applying the algorithm to a set of corpora, we have obtained the following interesting results: Firstly, a considerable proportion of human language demonstrates a strong pattern of periodicity in information; Secondly, new periods that are outside the distributions of typical structural units in text (e.g., sentence boundaries, elementary discourse units, etc.) are found and further confirmed via harmonic regression modeling. We conclude that the periodicity of information in language is a joint outcome from both structured factors and other driving factors that take effect at longer distances. The advantages of our periodicity detection method and its potentials in LLM-generation detection are further discussed.

Identifying the Periodicity of Information in Natural Language

TL;DR

This work introduces AutoPeriod of Surprisal (APS), a method to detect periods in the surprisal sequence of texts by combining a periodogram-based hint extraction with an autocorrelation-based validation step. APS reveals that a notable portion of human language exhibits periodic information, including long periods not explained by common structural units, and the detected periods correlate with, yet extend beyond, traditional EDU/sentence/paragraph lengths. A Harmonic Regression sanity check confirms the presence of periodic components when scaling models by APS-derived cues, and the method demonstrates improved predictability in periods-filtered data. Comparisons between human- and machine-generated texts indicate stronger long-range periodicity in LLM outputs, highlighting APS’s potential for generation-detection tasks and providing a new lens on information dynamics in natural language.

Abstract

Recent theoretical advancement of information density in natural language has brought the following question on desk: To what degree does natural language exhibit periodicity pattern in its encoded information? We address this question by introducing a new method called AutoPeriod of Surprisal (APS). APS adopts a canonical periodicity detection algorithm and is able to identify any significant periods that exist in the surprisal sequence of a single document. By applying the algorithm to a set of corpora, we have obtained the following interesting results: Firstly, a considerable proportion of human language demonstrates a strong pattern of periodicity in information; Secondly, new periods that are outside the distributions of typical structural units in text (e.g., sentence boundaries, elementary discourse units, etc.) are found and further confirmed via harmonic regression modeling. We conclude that the periodicity of information in language is a joint outcome from both structured factors and other driving factors that take effect at longer distances. The advantages of our periodicity detection method and its potentials in LLM-generation detection are further discussed.
Paper Structure (20 sections, 5 equations, 5 figures, 8 tables, 2 algorithms)

This paper contains 20 sections, 5 equations, 5 figures, 8 tables, 2 algorithms.

Figures (5)

  • Figure 1: Periodogram spectrum obtained from the surprisal sequence of document #0976 in PTB-WSJ corpus. Right: Red dots indicate the period hints returned by \ref{['alg:period_hints']}. Dashed lines are the power thresholds used at different confidence levels.
  • Figure 2: ACF filtering for surprisal sequence of document #0976 in PTB-WSJ corpus. Hints (red dots in top) are mapped to the raw periods in the ACF curve (red dots in middle). ACF filtering refines and returns the final periods (red stars in bottom).
  • Figure 3: Partitioning a corpus based on the periodicity detection outcomes from APS.
  • Figure 4: Distribution of periods identified by APS on RST Discourse Treebank, compared with the average lengths of structural units (EDUs, sentences, paragraphs). The average lengths of structural units are marked as dashed lines.
  • Figure 5: Distributions of periods from human (FACE BBC News) and LLM-generated (LLaMA-3-70B) texts.