Table of Contents
Fetching ...

Exploring language relations through syntactic distances and geographic proximity

Juan De Gregorio, Raúl Toral, David Sánchez

TL;DR

A significant correlation between language similarity and geographic distance is obtained, which underscores the influence of spatial proximity on language kinships and reveals definite clusters that correspond to well known language families and groups.

Abstract

Languages are grouped into families that share common linguistic traits. While this approach has been successful in understanding genetic relations between diverse languages, more analyses are needed to accurately quantify their relatedness, especially in less studied linguistic levels such as syntax. Here, we explore linguistic distances using series of parts of speech (POS) extracted from the Universal Dependencies dataset. Within an information-theoretic framework, we show that employing POS trigrams maximizes the possibility of capturing syntactic variations while being at the same time compatible with the amount of available data. Linguistic connections are then established by assessing pairwise distances based on the POS distributions. Intriguingly, our analysis reveals definite clusters that correspond to well known language families and groups, with exceptions explained by distinct morphological typologies. Furthermore, we obtain a significant correlation between language similarity and geographic distance, which underscores the influence of spatial proximity on language kinships.

Exploring language relations through syntactic distances and geographic proximity

TL;DR

A significant correlation between language similarity and geographic distance is obtained, which underscores the influence of spatial proximity on language kinships and reveals definite clusters that correspond to well known language families and groups.

Abstract

Languages are grouped into families that share common linguistic traits. While this approach has been successful in understanding genetic relations between diverse languages, more analyses are needed to accurately quantify their relatedness, especially in less studied linguistic levels such as syntax. Here, we explore linguistic distances using series of parts of speech (POS) extracted from the Universal Dependencies dataset. Within an information-theoretic framework, we show that employing POS trigrams maximizes the possibility of capturing syntactic variations while being at the same time compatible with the amount of available data. Linguistic connections are then established by assessing pairwise distances based on the POS distributions. Intriguingly, our analysis reveals definite clusters that correspond to well known language families and groups, with exceptions explained by distinct morphological typologies. Furthermore, we obtain a significant correlation between language similarity and geographic distance, which underscores the influence of spatial proximity on language kinships.
Paper Structure (22 sections, 26 equations, 12 figures)

This paper contains 22 sections, 26 equations, 12 figures.

Figures (12)

  • Figure 1: Estimated predictability gain when considering $(u+1)$th-order instead of $u$th-order transition probabilities in POS sequences of (a) German, (b) Icelandic, (c) Portuguese and (d) Czech, as extracted from the Universal Dependencies library.
  • Figure 2: Accuracy in language identification, determined by computing and comparing the probabilities of observing a given tagged sentence within each considered language, based on an $u$th-order Markov model.
  • Figure 3: Probability distribution of POS trigrams for (a) English and (b) Japanese.
  • Figure 4: Heatmap visualization of the Jensen-Shannon distance matrix, calculated from POS trigram distributions. Rows and columns are organized based on hierarchical clustering. The colour spectrum in the heatmap illustrates data matrix values.
  • Figure 5: Minimum spanning tree generated from the Jensen-Shannon distance matrix, with node colors representing clusters identified through $k$-medoids analysis. The shape of the nodes represents the language typology. Full lines are assigned to links between languages belonging to the same group; dashed lines for languages of the same family but different group; finally, dotted lines connect languages from distinct families.
  • ...and 7 more figures