Linguistic Structure from a Bottleneck on Sequential Information Processing
Richard Futrell, Michael Hahn
TL;DR
The paper investigates how minimizing predictive information $E = I[X_{past} : X_{future}]$ shapes linguistic structure, showing that codes with low $E$ organize into approximately independent, local features that resemble words and phrases. Through targeted simulations and expansive cross-linguistic corpus analyses, it demonstrates that natural languages reduce predictive information relative to baselines across phonology, morphology, syntax, and semantics, yielding locality, hierarchy, and systematicity. This work links the statistical and algebraic organization of language to general cognitive constraints on sequential information processing, offering a principled information-theoretic explanation for why languages exhibit word-like units and structured hierarchies. It also connects to machine learning and neuroscience by aligning language structure with ICA-like disentanglement, efficient neural coding, and next-token predictability in language models, highlighting broad implications for understanding and modeling human communication.
Abstract
Human language has a distinct systematic structure, where utterances break into individually meaningful words which are combined to form phrases. We show that natural-language-like systematicity arises in codes that are constrained by a statistical measure of complexity called predictive information, also known as excess entropy. Predictive information is the mutual information between the past and future of a stochastic process. In simulations, we find that such codes break messages into groups of approximately independent features which are expressed systematically and locally, corresponding to words and phrases. Next, drawing on crosslinguistic text corpora, we find that actual human languages are structured in a way that reduces predictive information compared to baselines at the levels of phonology, morphology, syntax, and lexical semantics. Our results establish a link between the statistical and algebraic structure of language and reinforce the idea that these structures are shaped by communication under general cognitive constraints.
