Patterns of Persistence and Diffusibility across the World's Languages
Yiyi Chen, Johannes Bjerva
TL;DR
The paper investigates why cross-linguistic similarities arise in colexification and phonology by distinguishing genealogical persistence from diffusion via contact. It builds a large-scale language graph for 1,966 languages, integrating semantic and phonological distances, genealogical relatedness, and linguistic contact intensity, using distributional colexification data (ColexNet+) from Bible translations and multiple lexical resources. Through regression and correlation analyses, the study tests established hypotheses (e.g., persistence of phonology vs. colexifications) and proposes new ones (differential persistence across core, emotion, abstract vs concrete concepts). The findings largely confirm that phonology is more persistent than colexifications and that colexifications show limited diffusion, while revealing nuanced patterns across concept types and relatedness levels, contributing a valuable resource for linguistics and multilingual NLP. The resource and results offer a scalable framework for cross-linguistic research and potential applications in transfer learning and comparative linguistics, with acknowledged biases and domain-specific caveats due to Bible-based data.
Abstract
Language similarities can be caused by genetic relatedness, areal contact, universality, or chance. Colexification, i.e. a type of similarity where a single lexical form is used to convey multiple meanings, is underexplored. In our work, we shed light on the linguistic causes of cross-lingual similarity in colexification and phonology, by exploring genealogical stability (persistence) and contact-induced change (diffusibility). We construct large-scale graphs incorporating semantic, genealogical, phonological and geographical data for 1,966 languages. We then show the potential of this resource, by investigating several established hypotheses from previous work in linguistics, while proposing new ones. Our results strongly support a previously established hypothesis in the linguistic literature, while offering contradicting evidence to another. Our large scale resource opens for further research across disciplines, e.g.~in multilingual NLP and comparative linguistics.
