Identifying the Periodicity of Information in Natural Language
Yulin Ou, Yu Wang, Yang Xu, Hendrik Buschmeier
TL;DR
This work introduces AutoPeriod of Surprisal (APS), a method to detect periods in the surprisal sequence of texts by combining a periodogram-based hint extraction with an autocorrelation-based validation step. APS reveals that a notable portion of human language exhibits periodic information, including long periods not explained by common structural units, and the detected periods correlate with, yet extend beyond, traditional EDU/sentence/paragraph lengths. A Harmonic Regression sanity check confirms the presence of periodic components when scaling models by APS-derived cues, and the method demonstrates improved predictability in periods-filtered data. Comparisons between human- and machine-generated texts indicate stronger long-range periodicity in LLM outputs, highlighting APS’s potential for generation-detection tasks and providing a new lens on information dynamics in natural language.
Abstract
Recent theoretical advancement of information density in natural language has brought the following question on desk: To what degree does natural language exhibit periodicity pattern in its encoded information? We address this question by introducing a new method called AutoPeriod of Surprisal (APS). APS adopts a canonical periodicity detection algorithm and is able to identify any significant periods that exist in the surprisal sequence of a single document. By applying the algorithm to a set of corpora, we have obtained the following interesting results: Firstly, a considerable proportion of human language demonstrates a strong pattern of periodicity in information; Secondly, new periods that are outside the distributions of typical structural units in text (e.g., sentence boundaries, elementary discourse units, etc.) are found and further confirmed via harmonic regression modeling. We conclude that the periodicity of information in language is a joint outcome from both structured factors and other driving factors that take effect at longer distances. The advantages of our periodicity detection method and its potentials in LLM-generation detection are further discussed.
