Table of Contents
Fetching ...

Patterns of Persistence and Diffusibility across the World's Languages

Yiyi Chen, Johannes Bjerva

TL;DR

The paper investigates why cross-linguistic similarities arise in colexification and phonology by distinguishing genealogical persistence from diffusion via contact. It builds a large-scale language graph for 1,966 languages, integrating semantic and phonological distances, genealogical relatedness, and linguistic contact intensity, using distributional colexification data (ColexNet+) from Bible translations and multiple lexical resources. Through regression and correlation analyses, the study tests established hypotheses (e.g., persistence of phonology vs. colexifications) and proposes new ones (differential persistence across core, emotion, abstract vs concrete concepts). The findings largely confirm that phonology is more persistent than colexifications and that colexifications show limited diffusion, while revealing nuanced patterns across concept types and relatedness levels, contributing a valuable resource for linguistics and multilingual NLP. The resource and results offer a scalable framework for cross-linguistic research and potential applications in transfer learning and comparative linguistics, with acknowledged biases and domain-specific caveats due to Bible-based data.

Abstract

Language similarities can be caused by genetic relatedness, areal contact, universality, or chance. Colexification, i.e. a type of similarity where a single lexical form is used to convey multiple meanings, is underexplored. In our work, we shed light on the linguistic causes of cross-lingual similarity in colexification and phonology, by exploring genealogical stability (persistence) and contact-induced change (diffusibility). We construct large-scale graphs incorporating semantic, genealogical, phonological and geographical data for 1,966 languages. We then show the potential of this resource, by investigating several established hypotheses from previous work in linguistics, while proposing new ones. Our results strongly support a previously established hypothesis in the linguistic literature, while offering contradicting evidence to another. Our large scale resource opens for further research across disciplines, e.g.~in multilingual NLP and comparative linguistics.

Patterns of Persistence and Diffusibility across the World's Languages

TL;DR

The paper investigates why cross-linguistic similarities arise in colexification and phonology by distinguishing genealogical persistence from diffusion via contact. It builds a large-scale language graph for 1,966 languages, integrating semantic and phonological distances, genealogical relatedness, and linguistic contact intensity, using distributional colexification data (ColexNet+) from Bible translations and multiple lexical resources. Through regression and correlation analyses, the study tests established hypotheses (e.g., persistence of phonology vs. colexifications) and proposes new ones (differential persistence across core, emotion, abstract vs concrete concepts). The findings largely confirm that phonology is more persistent than colexifications and that colexifications show limited diffusion, while revealing nuanced patterns across concept types and relatedness levels, contributing a valuable resource for linguistics and multilingual NLP. The resource and results offer a scalable framework for cross-linguistic research and potential applications in transfer learning and comparative linguistics, with acknowledged biases and domain-specific caveats due to Bible-based data.

Abstract

Language similarities can be caused by genetic relatedness, areal contact, universality, or chance. Colexification, i.e. a type of similarity where a single lexical form is used to convey multiple meanings, is underexplored. In our work, we shed light on the linguistic causes of cross-lingual similarity in colexification and phonology, by exploring genealogical stability (persistence) and contact-induced change (diffusibility). We construct large-scale graphs incorporating semantic, genealogical, phonological and geographical data for 1,966 languages. We then show the potential of this resource, by investigating several established hypotheses from previous work in linguistics, while proposing new ones. Our results strongly support a previously established hypothesis in the linguistic literature, while offering contradicting evidence to another. Our large scale resource opens for further research across disciplines, e.g.~in multilingual NLP and comparative linguistics.
Paper Structure (28 sections, 14 figures, 4 tables)

This paper contains 28 sections, 14 figures, 4 tables.

Figures (14)

  • Figure 1: A visualization of the Hypotheses. Left Top: Hypothesis H.1 of low persistence and high diffusibility of colexification patterns compared to phonological patterns. Right Top: Hypothesis H2.a of differential persistence and diffusibility in colexification patterns in nuclear, non-nuclear and emotion vocabularies. Left Bottom: Hypothesis H2.b of high persistence and low diffusibility of abstract colexification patterns compared to concrete colexification patterns. Right Bottom: Hypothesis H2.c of high persistence and low diffusibility of affectively loaded abstract colexification patterns compared to affectively loaded concrete colexification patterns.
  • Figure 1: Average number of colexification patterns per each concept in ColexNet+.
  • Figure 1: Language contact graph of a sub-graph of German. The colors represent neighbourhood between languages: center language (red), non-neighouring languages (green) and neighbouring languages (blue). (Top) The numbers on the edges represent the geographical distances between each pair of languages. (Bottom) The numbers on the edges represent the contact languages in between each pair of languages
  • Figure 1: Data Distribution of (left) Phonological Distance and (right) Colexificaiton Distance based on Nuclear vocabulary, by the level of genealogical relatedness.
  • Figure 1: Coefficients among variables for phon and colex. The Confidence interval at 95% are minimal, and for all $p<0.001$. The values that are invisible in the plots are near to 0.
  • ...and 9 more figures