Table of Contents
Fetching ...

The Distribution of Phoneme Frequencies across the World's Languages: Macroscopic and Microscopic Information-Theoretic Models

Fermín Moscoso del Prado Martín, Suchir Salhan

Abstract

We demonstrate that the frequency distribution of phonemes across languages can be explained at both macroscopic and microscopic levels. Macroscopically, phoneme rank-frequency distributions closely follow the order statistics of a symmetric Dirichlet distribution whose single concentration parameter scales systematically with phonemic inventory size, revealing a robust compensation effect whereby larger inventories exhibit lower relative entropy. Microscopically, a Maximum Entropy model incorporating constraints from articulatory, phonotactic, and lexical structure accurately predicts language-specific phoneme probabilities. Together, these findings provide a unified information-theoretic account of phoneme frequency structure.

The Distribution of Phoneme Frequencies across the World's Languages: Macroscopic and Microscopic Information-Theoretic Models

Abstract

We demonstrate that the frequency distribution of phonemes across languages can be explained at both macroscopic and microscopic levels. Macroscopically, phoneme rank-frequency distributions closely follow the order statistics of a symmetric Dirichlet distribution whose single concentration parameter scales systematically with phonemic inventory size, revealing a robust compensation effect whereby larger inventories exhibit lower relative entropy. Microscopically, a Maximum Entropy model incorporating constraints from articulatory, phonotactic, and lexical structure accurately predicts language-specific phoneme probabilities. Together, these findings provide a unified information-theoretic account of phoneme frequency structure.
Paper Structure (21 sections, 19 equations, 4 figures, 1 table)

This paper contains 21 sections, 19 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Rank-frequency plots for the five phoneme frequency distributions discussed in Sigurd:1968, compared with the word frequency distribution in the Brown corpus. Note the logarithmic scales.
  • Figure 2: Geographic and genetic diversity of the languages included in the UDHR dataset.
  • Figure 3: (a) Rank-frequency plot for the American English phoneme frequency distribution (red dots and lines) discussed in Sigurd:1968, overlaid with the predicted rank-frequency distribution for symmetric Dirichlet distributions with estimated concentration parameters $\hat{\alpha}$. The green lines are the predicted mean order statistics, the darker green shading denotes their standard deviations, and the lighter green shading denotes their 95% C.I. (b) Relationship between the size of the phonemic inventory (horizontal axis; note the logarithmic scale) with the estimated value of the concentration parameter $\hat{\alpha}$ (vertical axis) across the three datasets. Each point denotes the phonemic inventory for a language variety. The black line plots a doubly logarithmic linear regression (whose parameters are given in Equation \ref{['eq:prediction']}) and the shaded area is its 95% C.I. (c) Rank-frequency plot for the American English phoneme frequency distribution (red dots and lines) discussed in Sigurd:1968, overlaid with its rank-frequency distribution reconstructed from its inventory size using Equation \ref{['eq:prediction']}. The green lines are the predicted mean order statistics, the darker green shading denotes their standard deviations, and the lighter green shading denotes their 95% C.I.
  • Figure 4: (a) Distributions (kernel density estimates) of the estimated values of the maximum entropy Lagrange multipliers across the UDHR datasets. (b) Comparison between the observed (red) and maximum entropy guessed (green) probabilities of the phonemes in the Abkhaz UDHR dataset (the horizontal axis plots the ranks of the observed data). (c) Comparison between the maximum entropy guessed (horizontal axis) and observed (vertical axis) probabilities of the phonemes across all UDHR datasets. The solid line plots a non-linear regression (lowess) and the dashed line plots the identity (note logarithmic scales). (d) Comparison between the maximum entropy guessed (horizontal axis) and observed (vertical axis) relative entropies of the phoneme distributions across all UDHR datasets. The solid line plots a linear regression (the shading denotes its 95% C.I.) and the dashed line plots the identity.