Identifying the Periodicity of Information in Natural Language

Yulin Ou; Yu Wang; Yang Xu; Hendrik Buschmeier

Identifying the Periodicity of Information in Natural Language

Yulin Ou, Yu Wang, Yang Xu, Hendrik Buschmeier

TL;DR

This work introduces AutoPeriod of Surprisal (APS), a method to detect periods in the surprisal sequence of texts by combining a periodogram-based hint extraction with an autocorrelation-based validation step. APS reveals that a notable portion of human language exhibits periodic information, including long periods not explained by common structural units, and the detected periods correlate with, yet extend beyond, traditional EDU/sentence/paragraph lengths. A Harmonic Regression sanity check confirms the presence of periodic components when scaling models by APS-derived cues, and the method demonstrates improved predictability in periods-filtered data. Comparisons between human- and machine-generated texts indicate stronger long-range periodicity in LLM outputs, highlighting APS’s potential for generation-detection tasks and providing a new lens on information dynamics in natural language.

Abstract

Recent theoretical advancement of information density in natural language has brought the following question on desk: To what degree does natural language exhibit periodicity pattern in its encoded information? We address this question by introducing a new method called AutoPeriod of Surprisal (APS). APS adopts a canonical periodicity detection algorithm and is able to identify any significant periods that exist in the surprisal sequence of a single document. By applying the algorithm to a set of corpora, we have obtained the following interesting results: Firstly, a considerable proportion of human language demonstrates a strong pattern of periodicity in information; Secondly, new periods that are outside the distributions of typical structural units in text (e.g., sentence boundaries, elementary discourse units, etc.) are found and further confirmed via harmonic regression modeling. We conclude that the periodicity of information in language is a joint outcome from both structured factors and other driving factors that take effect at longer distances. The advantages of our periodicity detection method and its potentials in LLM-generation detection are further discussed.

Identifying the Periodicity of Information in Natural Language

TL;DR

Abstract

Paper Structure (20 sections, 5 equations, 5 figures, 8 tables, 2 algorithms)

This paper contains 20 sections, 5 equations, 5 figures, 8 tables, 2 algorithms.

Introduction
Background
UID and the Natural Fluctuation of Information
Why Studying Periodicity of Surprisal
Periodicity Detection
Prerequisites: Fourier transform and periodogram
Step 1: Extract hints from periodogram
Step 2: Period validation via ACF Filtering
Corpora and Language Models
Result: Identified Periods of Information
Proportions of periodic documents
Comparing APS periods with linguistic structural units
Result: Sanity-Check with Harmonic Regression Modeling
Periods/hints directly used as scaler in HR
Filtering data using periods/hints improves HR performance
...and 5 more sections

Figures (5)

Figure 1: Periodogram spectrum obtained from the surprisal sequence of document #0976 in PTB-WSJ corpus. Right: Red dots indicate the period hints returned by \ref{['alg:period_hints']}. Dashed lines are the power thresholds used at different confidence levels.
Figure 2: ACF filtering for surprisal sequence of document #0976 in PTB-WSJ corpus. Hints (red dots in top) are mapped to the raw periods in the ACF curve (red dots in middle). ACF filtering refines and returns the final periods (red stars in bottom).
Figure 3: Partitioning a corpus based on the periodicity detection outcomes from APS.
Figure 4: Distribution of periods identified by APS on RST Discourse Treebank, compared with the average lengths of structural units (EDUs, sentences, paragraphs). The average lengths of structural units are marked as dashed lines.
Figure 5: Distributions of periods from human (FACE BBC News) and LLM-generated (LLaMA-3-70B) texts.

Identifying the Periodicity of Information in Natural Language

TL;DR

Abstract

Identifying the Periodicity of Information in Natural Language

Authors

TL;DR

Abstract

Table of Contents

Figures (5)