Table of Contents
Fetching ...

A Graph Diffusion Algorithm for Lexical Similarity Evaluation

Karol Mikula, Mariana Sarkociová Remešíková

TL;DR

The paper proposes a graph-d diffusion framework to quantify lexical similarity across languages by embedding translations into cluster-probability vectors via a weighted directed graph with Dirichlet boundary conditions. Distances between translations are computed from phonetic transcriptions using a modified Damerau-Levenshtein metric and aggregated through a diffusion process on a three-layer graph that includes reference languages, hypothetical boundary languages, and classified languages; the final coordinates in $[0,1]^c$ represent probabilities of cluster membership. Key contributions include handling of synonyms, asymmetry to correct for unequal cluster sizes, a tunable diffusion parameter $K$, and a parameterization strategy that yields interpretable similarity distributions; case studies on European languages demonstrate the method's ability to reveal realistic mutual influences and relationships in multilingual territories. The approach offers a scalable, data-agnostic way to analyze lexical similarity beyond simple word-for-word ratios, with potential applications in diachronic linguistics and multilingual studies.

Abstract

In this paper, we present an algorithm for evaluating lexical similarity between a given language and several reference language clusters. As an input, we have a list of concepts and the corresponding translations in all considered languages. Moreover, each reference language is assigned to one of $c$ language clusters. For each of the concepts, the algorithm computes the distance between each pair of translations. Based on these distances, it constructs a weighted directed graph, where every vertex represents a language. After, it solves a graph diffusion equation with a Dirichlet boundary condition, where the unknown is a map from the vertex set to $\mathbb{R}^c$. The resulting coordinates are values from the interval $[0,1]$ and they can be interpreted as probabilities of belonging to each of the clusters or as a lexical similarity distribution with respect to the reference clusters. The distances between translations are calculated using phonetic transcriptions and a modification of the Damerau-Levenshtein distance. The algorithm can be useful in analyzing relationships between languages spoken in multilingual territories with a lot of mutual influences. We demonstrate this by presenting a case study regarding various European languages.

A Graph Diffusion Algorithm for Lexical Similarity Evaluation

TL;DR

The paper proposes a graph-d diffusion framework to quantify lexical similarity across languages by embedding translations into cluster-probability vectors via a weighted directed graph with Dirichlet boundary conditions. Distances between translations are computed from phonetic transcriptions using a modified Damerau-Levenshtein metric and aggregated through a diffusion process on a three-layer graph that includes reference languages, hypothetical boundary languages, and classified languages; the final coordinates in represent probabilities of cluster membership. Key contributions include handling of synonyms, asymmetry to correct for unequal cluster sizes, a tunable diffusion parameter , and a parameterization strategy that yields interpretable similarity distributions; case studies on European languages demonstrate the method's ability to reveal realistic mutual influences and relationships in multilingual territories. The approach offers a scalable, data-agnostic way to analyze lexical similarity beyond simple word-for-word ratios, with potential applications in diachronic linguistics and multilingual studies.

Abstract

In this paper, we present an algorithm for evaluating lexical similarity between a given language and several reference language clusters. As an input, we have a list of concepts and the corresponding translations in all considered languages. Moreover, each reference language is assigned to one of language clusters. For each of the concepts, the algorithm computes the distance between each pair of translations. Based on these distances, it constructs a weighted directed graph, where every vertex represents a language. After, it solves a graph diffusion equation with a Dirichlet boundary condition, where the unknown is a map from the vertex set to . The resulting coordinates are values from the interval and they can be interpreted as probabilities of belonging to each of the clusters or as a lexical similarity distribution with respect to the reference clusters. The distances between translations are calculated using phonetic transcriptions and a modification of the Damerau-Levenshtein distance. The algorithm can be useful in analyzing relationships between languages spoken in multilingual territories with a lot of mutual influences. We demonstrate this by presenting a case study regarding various European languages.

Paper Structure

This paper contains 18 sections, 1 theorem, 23 equations, 17 figures, 3 tables.

Key Result

Proposition 2.1

Let $(\varphi_1,\dots,\varphi_{n+m})$ be the solution of the system (EqLinSystem) and let $\varphi_i^k$ be the $k$-th component of $\varphi_i$. Then we have

Figures (17)

  • Figure 1: The directed graph $G$ used in our model. Here, we have 5 reference languages $v_1,\dots, v_5$, one classified language $l_1$ and two hypothetical languages: $h_1$ for the cluster $\{v_1,v_2,v_3\}$ and $h_2$ for the cluster $\{v_4,v_5\}$. The two-sided arrows represent pairs of directed edges, where the diffusion coefficients are equal in both directions. On the pairs of edges connecting the two clusters, we have an asymmetric diffusion, since the clusters are of different sizes. There is a one-way diffusion between the classified language and all reference languages and also between each reference language and its corresponding hypothetical language.
  • Figure 2: Voiced, non-lateral, non-sibilant and non-coarticulated consonants represented in our model.
  • Figure 3: The vowel chart created by International Phonetic Association IPA. Most vowels are listed in pairs, where the one on the right represents the round version of the sound. In our model, the first two coordinates of a vowel are set so that the vowel '@' is placed at the origin and the coordinate axes intersect the trapezoid at $(-1,0)$, $(1,0)$, $(0,-1)$ and $(0,1)$.
  • Figure 4: The classification of the Scots word 'flour' for different values of the diffusion parameter $K$. The points 'Slavic', 'Romance' and 'Germanic' represent the hypothetical languages. On the left, we used the value $K=0.4$. We can observe that all words are, to some extent, mutually attracted (they diverted at least a little from their hypothetical language). For $K=0.6$ (middle), we can see that Germanic words are still slightly attracted to their Romance counterparts, which reflects their distant common origin. The only exception is the word 'flower' which exhibits a quite strong mutual attraction with its Romance cognates. In the third picture, we set $K=1.0$. This value is already a little too high, since the result only weakly reflects the relationships between the words in the experiment.
  • Figure 5: Classification of words that have phonetically similar counterparts in several reference clusters of varying size: the Estonian 'viil' (file). The basic model (\ref{['EqLaplace']}) -- (\ref{['EqGraphLaplace']}) causes an unrealistic dominance of the Germanic cluster (left). The directed graph model (\ref{['EqLaplaceFinal']}) -- (\ref{['EqGraphLaplaceFinal']}) fairly evaluated the match with each cluster and did not prioritize clusters of a bigger size (right).
  • ...and 12 more figures

Theorems & Definitions (3)

  • Proposition 2.1
  • proof
  • Remark 2.2