Table of Contents
Fetching ...

Toward a statistical mechanics of four letter words

Greg J. Stephens, William Bialek

TL;DR

This work considers words as a network of interacting letters, and approximate the probability distribution of states taken on by this network, and suggests that these states provide an effective vocabulary which is matched to the frequency of word use and much smaller than the full lexicon.

Abstract

We consider words as a network of interacting letters, and approximate the probability distribution of states taken on by this network. Despite the intuition that the rules of English spelling are highly combinatorial (and arbitrary), we find that maximum entropy models consistent with pairwise correlations among letters provide a surprisingly good approximation to the full statistics of four letter words, capturing ~92% of the multi-information among letters and even "discovering" real words that were not represented in the data from which the pairwise correlations were estimated. The maximum entropy model defines an energy landscape on the space of possible words, and local minima in this landscape account for nearly two-thirds of words used in written English.

Toward a statistical mechanics of four letter words

TL;DR

This work considers words as a network of interacting letters, and approximate the probability distribution of states taken on by this network, and suggests that these states provide an effective vocabulary which is matched to the frequency of word use and much smaller than the full lexicon.

Abstract

We consider words as a network of interacting letters, and approximate the probability distribution of states taken on by this network. Despite the intuition that the rules of English spelling are highly combinatorial (and arbitrary), we find that maximum entropy models consistent with pairwise correlations among letters provide a surprisingly good approximation to the full statistics of four letter words, capturing ~92% of the multi-information among letters and even "discovering" real words that were not represented in the data from which the pairwise correlations were estimated. The maximum entropy model defines an energy landscape on the space of possible words, and local minima in this landscape account for nearly two-thirds of words used in written English.

Paper Structure

This paper contains 3 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: (a) The six pairwise marginal distributions of four-letter words sampled from the Jane Austen corpus. Common letter pairs such as "th" in $\rho_{12}$ are apparent in their large marginal probability. (b) The iterative scaling algorithm solves the constrained maximization problem to high precision. All pairwise marginal components of the full distribution compared to the marginals constructed from the computed maximum entropy distribution
  • Figure 2: (left) The pairwise maximum entropy model provides an excellent approximation to the full distribution of four-letter words, capturing $92\%$ of the multi-information. (right-dots) Scatter plot of the four letter word probabilities in the full distribution $P_{sampled}$ vs. the corresponding probabilities in the maximum entropy distribution $P_2$. (right-red crosses) To facilitate the comparison we divided the full probability into 20 equally log-spaced bins and computed the mean maximum entropy probability conditioned on the states in the full distribution within each bin. The dashed line marks the identity. (right-blue circles) Even for small probabilities, there are still words such as 'edge' and 'itch' whose states are well-captured by the pairwise model.
  • Figure 3: The Zipf plot for all words in the corpus (black line), four letter words in the corpus (blue crosses), and four letter words in the maximum entropy model (red crosses). Green circles denote 'non-words', states in the maximum entropy model that didn't appear in the corpus. The 25 most likely 'non--words' are shown in the text inset (ordered in decreasing probability from left to right and top to bottom). Some of these are recognizable as real words that just did not appear in the corpus, and even the others have plausible spelling.