Table of Contents
Fetching ...

Statistical field theory for dialectology

James Burridge

TL;DR

This work develops a mesoscopic statistical field theory for language evolution by coupling a stochastic state field describing variant frequencies with a latent bias field whose path probability is governed by an Onsager–Machlup prior. The model includes spatial interactions, social conformity, mutation, and a data-driven lifting construction that regularizes the bias field over space and time. Through inference on large-scale USA dialect data, the authors find evidence for surface-tension-like coarsening of dialect regions and quantify the bias field's half-life, enabling near-term predictions and insight into the drivers of change. The approach draws strong parallels with physical diffusion and interface dynamics, offering a principled framework to study language change and a path toward a physics-inspired theory of linguistics, while highlighting the need for richer state spaces and connectivity to enhance predictive power.

Abstract

Is it possible to develop a `physics of language' which can explain the spatial, temporal and social patterns we see, and which can predict future change like we forecast the weather? Such a theory is likely to involve ideas from statistical physics. A substantial literature already applies these ideas to language. However, we lack a model which can match the spatial-temporal detail of historical changes at the level of individual linguistic features, and which offers a principled mechanism to predict the future. Here we present a statistical field theory for the evolution of linguistic variables which takes steps to fill this gap. Linguistic variant frequencies are represented as a stochastic state field with spatial interaction and social conformity, coupled to a latent bias field with Onsager Machlup action that reduces overfitting to data. We derive parameter inference procedures and demonstrate them using examples of large-scale dialect survey data from the twentieth century United States. The bias field has a characteristic half-life, which determines the horizon over which linguistic change can be predicted. Inferred model parameters provide evidence for surface-tension-driven coarsening of dialect regions, with population-density gradients exerting systematic forces on interfaces.

Statistical field theory for dialectology

TL;DR

This work develops a mesoscopic statistical field theory for language evolution by coupling a stochastic state field describing variant frequencies with a latent bias field whose path probability is governed by an Onsager–Machlup prior. The model includes spatial interactions, social conformity, mutation, and a data-driven lifting construction that regularizes the bias field over space and time. Through inference on large-scale USA dialect data, the authors find evidence for surface-tension-like coarsening of dialect regions and quantify the bias field's half-life, enabling near-term predictions and insight into the drivers of change. The approach draws strong parallels with physical diffusion and interface dynamics, offering a principled framework to study language change and a path toward a physics-inspired theory of linguistics, while highlighting the need for richer state spaces and connectivity to enhance predictive power.

Abstract

Is it possible to develop a `physics of language' which can explain the spatial, temporal and social patterns we see, and which can predict future change like we forecast the weather? Such a theory is likely to involve ideas from statistical physics. A substantial literature already applies these ideas to language. However, we lack a model which can match the spatial-temporal detail of historical changes at the level of individual linguistic features, and which offers a principled mechanism to predict the future. Here we present a statistical field theory for the evolution of linguistic variables which takes steps to fill this gap. Linguistic variant frequencies are represented as a stochastic state field with spatial interaction and social conformity, coupled to a latent bias field with Onsager Machlup action that reduces overfitting to data. We derive parameter inference procedures and demonstrate them using examples of large-scale dialect survey data from the twentieth century United States. The bias field has a characteristic half-life, which determines the horizon over which linguistic change can be predicted. Inferred model parameters provide evidence for surface-tension-driven coarsening of dialect regions, with population-density gradients exerting systematic forces on interfaces.

Paper Structure

This paper contains 32 sections, 104 equations, 18 figures.

Figures (18)

  • Figure 1: Local logistic regression estimates of the fraction of speakers born in 1950 who use 'soda' to describe a carbonated beverage. Data from the Cambridge Online Survey of World Englishes vau00.
  • Figure 2: Voronoi tessellation of the USA obtained by k-means clustering citizens into $10^3$ demes based on their zip codes. Voronoi seeds are given by cluster centroids. We use the Universal Transverse Mercator EPSG:32615 coordinate system. Blue dots show zip codes with sizes proportional to the population at each code. Red dots show deme centroids.
  • Figure 3: The first six eigenvectors of $\underline{\bm{L}} =\underline{\bm{I}} -\underline{\bm{\Sigma}}$ with $\eta=2000$km. In this case the variance retained is $R^2(6)=99.4\%$.
  • Figure 4: Fractions of speakers (for different birth years) who use the 'soda' variant, estimated using local logistic regression.
  • Figure 5: Fractions of speakers (for different birth years) who use the 'roly poly' variant, estimated using local logistic regression.
  • ...and 13 more figures