Table of Contents
Fetching ...

Linguistic Structure from a Bottleneck on Sequential Information Processing

Richard Futrell, Michael Hahn

TL;DR

The paper investigates how minimizing predictive information $E = I[X_{past} : X_{future}]$ shapes linguistic structure, showing that codes with low $E$ organize into approximately independent, local features that resemble words and phrases. Through targeted simulations and expansive cross-linguistic corpus analyses, it demonstrates that natural languages reduce predictive information relative to baselines across phonology, morphology, syntax, and semantics, yielding locality, hierarchy, and systematicity. This work links the statistical and algebraic organization of language to general cognitive constraints on sequential information processing, offering a principled information-theoretic explanation for why languages exhibit word-like units and structured hierarchies. It also connects to machine learning and neuroscience by aligning language structure with ICA-like disentanglement, efficient neural coding, and next-token predictability in language models, highlighting broad implications for understanding and modeling human communication.

Abstract

Human language has a distinct systematic structure, where utterances break into individually meaningful words which are combined to form phrases. We show that natural-language-like systematicity arises in codes that are constrained by a statistical measure of complexity called predictive information, also known as excess entropy. Predictive information is the mutual information between the past and future of a stochastic process. In simulations, we find that such codes break messages into groups of approximately independent features which are expressed systematically and locally, corresponding to words and phrases. Next, drawing on crosslinguistic text corpora, we find that actual human languages are structured in a way that reduces predictive information compared to baselines at the levels of phonology, morphology, syntax, and lexical semantics. Our results establish a link between the statistical and algebraic structure of language and reinforce the idea that these structures are shaped by communication under general cognitive constraints.

Linguistic Structure from a Bottleneck on Sequential Information Processing

TL;DR

The paper investigates how minimizing predictive information shapes linguistic structure, showing that codes with low organize into approximately independent, local features that resemble words and phrases. Through targeted simulations and expansive cross-linguistic corpus analyses, it demonstrates that natural languages reduce predictive information relative to baselines across phonology, morphology, syntax, and semantics, yielding locality, hierarchy, and systematicity. This work links the statistical and algebraic organization of language to general cognitive constraints on sequential information processing, offering a principled information-theoretic explanation for why languages exhibit word-like units and structured hierarchies. It also connects to machine learning and neuroscience by aligning language structure with ICA-like disentanglement, efficient neural coding, and next-token predictability in language models, highlighting broad implications for understanding and modeling human communication.

Abstract

Human language has a distinct systematic structure, where utterances break into individually meaningful words which are combined to form phrases. We show that natural-language-like systematicity arises in codes that are constrained by a statistical measure of complexity called predictive information, also known as excess entropy. Predictive information is the mutual information between the past and future of a stochastic process. In simulations, we find that such codes break messages into groups of approximately independent features which are expressed systematically and locally, corresponding to words and phrases. Next, drawing on crosslinguistic text corpora, we find that actual human languages are structured in a way that reduces predictive information compared to baselines at the levels of phonology, morphology, syntax, and lexical semantics. Our results establish a link between the statistical and algebraic structure of language and reinforce the idea that these structures are shaped by communication under general cognitive constraints.
Paper Structure (25 sections, 7 equations, 8 figures)

This paper contains 25 sections, 7 equations, 8 figures.

Figures (8)

  • Figure 1: Example utterances describing an image in English and various hypothetical languages. A. An English utterance exhibiting systematicity and locality. B. An unnatural systematic language in which gol means a cat head paired with a dog head and nar means a cat body paired with a dog body. C. A nonlocal but systematic language in which an utterance is formed by interleaving the words for 'cat' and 'dog'. D. A holistic language in which the form vek means 'a cat with a dog' with no correspondence between parts of form and parts of meaning.
  • Figure 2: Two examples of linguistic systematicity as a homomorphism. $L(\cdot)$ stands for the English language, seen as a function from meanings to forms (strings). A. The meaning naturally decomposes into two features corresponding to the two animals. The form a cat with a dog decomposes systematically into forms for the cat and the dog, concatenated together with the string with between them. B. The meaning naturally decomposes into two features, corresponding to color and shape. The form blue square decomposes systematically into forms for the color and the shape, concatenated together.
  • Figure 3: Schematic calculation of predictive information as the sum of $n$-gram entropies $h_n$ minus the asymptotic entropy rate $h$.
  • Figure 4: Simulations of languages for coinflip distributions. A. Two unambiguous languages for meanings consisting of three weighted coinflips. In the systematic language, each letter corresponds to the outcome from one coinflip. In the holistic language, there is no natural systematic relationship between the form and the meaning. B. Calculation of predictive information for the source and two languages in panel A. The systematic language has lower predictive information. C. Predictive information of all bijective mappings from meanings to length-3 binary strings, for the meanings and source in panel A. Languages are ordered by predictive information and colored by the number of coinflips expressed systematically: 3 for a fully systematic language and 0 for a fully holistic language. The inset box zooms in on the low predictive information region. D. Languages used in panel E along with an example source, which has mutual information $\operatorname{I}[M_2 : M_3] \approx 0.18$ bits. E. Predictive information of various languages for varying levels of mutual information between coinflips $M_2$ and $M_3$ (see text). Zero mutual information corresponds to panels B and C. The 'natural' language expresses $M_2$ and $M_3$ together holistically. The 'unnatural' language expresses $M_1$ and $M_2$ together holistically.
  • Figure 5: Simulations of codes with different orders or elements. A. Predictive information of all string permutations of a systematic language for a Zipfian source. Permutations that combine components by concatenation, marked in red, achieve the lowest predictive information. The inset zooms in on the 2000 permutations with the lowest predictive information. B. A hierarchically-structured source distribution (see text) and predictive information of all permutations of a systematic language for this source. A language is well-nested when all groups of letters corresponding to groupings in the inset tree figure are contiguous.
  • ...and 3 more figures