Table of Contents
Fetching ...

Criticality in Formal Languages and Statistical Physics

Henry W. Lin, Max Tegmark

TL;DR

The paper reveals that mutual information decay in formal languages depends on the generative grammar: probabilistic regular grammars yield exponential decay, while context-free grammars with hierarchical depth can produce power-law decay, signaling critical-like long-range correlations. It introduces rational mutual information as a practical bound and derives that probabilistic regular grammars cannot exhibit criticality, whereas PCFGs can (Theorem 3). By linking Bayesian networks, CNF grammar forms, and deep hierarchical models, the work connects these ideas to physics (no 1D phase transitions) and to neural networks, suggesting that depth enables short-path correlations that reproduce long-range dependencies. It also proposes a practical diagnostic—analyzing mutual information as a function of symbol separation—to evaluate and improve machine learning models, particularly recurrent architectures like LSTMs.

Abstract

We show that the mutual information between two symbols, as a function of the number of symbols between the two, decays exponentially in any probabilistic regular grammar, but can decay like a power law for a context-free grammar. This result about formal languages is closely related to a well-known result in classical statistical mechanics that there are no phase transitions in dimensions fewer than two. It is also related to the emergence of power-law correlations in turbulence and cosmological inflation through recursive generative processes. We elucidate these physics connections and comment on potential applications of our results to machine learning tasks like training artificial recurrent neural networks. Along the way, we introduce a useful quantity which we dub the rational mutual information and discuss generalizations of our claims involving more complicated Bayesian networks.

Criticality in Formal Languages and Statistical Physics

TL;DR

The paper reveals that mutual information decay in formal languages depends on the generative grammar: probabilistic regular grammars yield exponential decay, while context-free grammars with hierarchical depth can produce power-law decay, signaling critical-like long-range correlations. It introduces rational mutual information as a practical bound and derives that probabilistic regular grammars cannot exhibit criticality, whereas PCFGs can (Theorem 3). By linking Bayesian networks, CNF grammar forms, and deep hierarchical models, the work connects these ideas to physics (no 1D phase transitions) and to neural networks, suggesting that depth enables short-path correlations that reproduce long-range dependencies. It also proposes a practical diagnostic—analyzing mutual information as a function of symbol separation—to evaluate and improve machine learning models, particularly recurrent architectures like LSTMs.

Abstract

We show that the mutual information between two symbols, as a function of the number of symbols between the two, decays exponentially in any probabilistic regular grammar, but can decay like a power law for a context-free grammar. This result about formal languages is closely related to a well-known result in classical statistical mechanics that there are no phase transitions in dimensions fewer than two. It is also related to the emergence of power-law correlations in turbulence and cosmological inflation through recursive generative processes. We elucidate these physics connections and comment on potential applications of our results to machine learning tasks like training artificial recurrent neural networks. Along the way, we introduce a useful quantity which we dub the rational mutual information and discuss generalizations of our claims involving more complicated Bayesian networks.

Paper Structure

This paper contains 20 sections, 61 equations, 5 figures.

Figures (5)

  • Figure 1: Decay of mutual information with separation. Here the mutual information in bits per symbol is shown as a function of separation $d(X,Y) = |i-j|$, where the symbols $X$ and $Y$ are located at positions $i$ and $j$ in the sequence in question, and shaded bands correspond to $1-\sigma$ error bars. The statistics were computed using a sliding window using an estimator for the mutual information detailed in Appendix D. All measured curves are seen to decay roughly as power laws, explaining why they cannot be accurately modeled as Markov processes --- for which the mutual information instead plummets exponentially (the example shown has $I\propto e^{-d/6}$). The measured curves are seen to be qualitatively similar to that of a famous critical system in physics: a 1D slice through a critical 2D Ising model, where the slope is $-1/2$. The human genome data consists of 177,696,512 base pairs {A, C, T,G} from chromosome 5 from the National Center for Biotechnology Information genome, with unknown base pairs omitted. The Bach data consists of 5727 notes from Partita No. 2 bach, with all notes mapped into a 12-symbol alphabet consisting of the 12 half-tones {C, C#, D, D#, E, F, F#, G, G#, A, A#, B}, with all timing, volume and octave information discarded. The three text corpuses are 100 MB from Wikipedia hutter (206 symbols), the first 114 MB of a French corpus french (185 symbols) and 27 MB of English articles from slate.com (143 symbols). The large long range information appears to be dominated by poems in the French sample and by html-like syntax in the Wikipedia sample.
  • Figure 2: Both a traditional Markov process (top) and our recursive generative grammar process (bottom) can be represented as Bayesian networks, where the random variable at each node depends only on the node pointing to it with an arrow. The numbers show the geodesic distance $\Delta$ to the leftmost node, defined as the smallest number of edges that must be traversed to get there. Roughly speaking, our results show that for large $\Delta$, the mutual information decays exponentially with $\Delta$ (see Theorem 1 and 2). Since this geodesic distance $\Delta$ grows only logarithmically with the separation in time in a hierarchical generative grammar (the hierarchy creates very efficient shortcuts), the exponential kills the logarithm and we are left with power-law decays of mutual information in such languages.
  • Figure 3: Our deep generative grammar model can be viewed as an idealization of a long-short term memory (LSTM) recurrent neural net, where the "forget weights" drop with depth so that the forget timescales grow exponentially with depth. The graph drawn here is clearly isomorphic to the graph drawn in Figure 1. For each cell, we approximate the usual incremental updating rule by either perfectly remembering the previous state (horizontal arrows) or by ignoring the previous state and determining the cell state by a random rule depending on the node above (vertical arrows).
  • Figure 4: Diagnosing different models with by hallucinating text and then measuring the mutual information as a function of separation. The red line is the mutual information of enwik8, a 100 MB sample of English Wikipedia. In shaded blue is the mutual information of hallucinated Wikipedia from a trained LSTM with 3 layers of size 256. We plot in solid black the mutual information of a Markov process on single characters, which we compute exactly. (This would correspond to the mutual information of hallucinations in the limit where the length of the hallucinations goes to infinity). This curve shows a sharp exponential decay after a distance of $\sim 10$, in agreement with our theoretical predictions. We also measured the mutual information for hallucinated text on a Markov process for bigrams, which still underperforms the LSTMs in long-ranged correlations, despite having $\sim 10^3$ more parameters than
  • Figure 5: Decay of rational mutual information with separation for a binary sequence from a numerical simulation with probabilities $p(0|0) = p(1|1) = 0.9$ and a branching factor $q=2$. The blue curve is not a fit to the simulated data but rather an analytic calculation. The smooth power law displayed on the left is what is predicted by our "continuum" approximation. The very small discrepancies (right) are not random but are fully accounted for by more involved exact calculations with discrete sums.