Statistical field theory for dialectology
James Burridge
TL;DR
This work develops a mesoscopic statistical field theory for language evolution by coupling a stochastic state field describing variant frequencies with a latent bias field whose path probability is governed by an Onsager–Machlup prior. The model includes spatial interactions, social conformity, mutation, and a data-driven lifting construction that regularizes the bias field over space and time. Through inference on large-scale USA dialect data, the authors find evidence for surface-tension-like coarsening of dialect regions and quantify the bias field's half-life, enabling near-term predictions and insight into the drivers of change. The approach draws strong parallels with physical diffusion and interface dynamics, offering a principled framework to study language change and a path toward a physics-inspired theory of linguistics, while highlighting the need for richer state spaces and connectivity to enhance predictive power.
Abstract
Is it possible to develop a `physics of language' which can explain the spatial, temporal and social patterns we see, and which can predict future change like we forecast the weather? Such a theory is likely to involve ideas from statistical physics. A substantial literature already applies these ideas to language. However, we lack a model which can match the spatial-temporal detail of historical changes at the level of individual linguistic features, and which offers a principled mechanism to predict the future. Here we present a statistical field theory for the evolution of linguistic variables which takes steps to fill this gap. Linguistic variant frequencies are represented as a stochastic state field with spatial interaction and social conformity, coupled to a latent bias field with Onsager Machlup action that reduces overfitting to data. We derive parameter inference procedures and demonstrate them using examples of large-scale dialect survey data from the twentieth century United States. The bias field has a characteristic half-life, which determines the horizon over which linguistic change can be predicted. Inferred model parameters provide evidence for surface-tension-driven coarsening of dialect regions, with population-density gradients exerting systematic forces on interfaces.
