A Graph Diffusion Algorithm for Lexical Similarity Evaluation
Karol Mikula, Mariana Sarkociová Remešíková
TL;DR
The paper proposes a graph-d diffusion framework to quantify lexical similarity across languages by embedding translations into cluster-probability vectors via a weighted directed graph with Dirichlet boundary conditions. Distances between translations are computed from phonetic transcriptions using a modified Damerau-Levenshtein metric and aggregated through a diffusion process on a three-layer graph that includes reference languages, hypothetical boundary languages, and classified languages; the final coordinates in $[0,1]^c$ represent probabilities of cluster membership. Key contributions include handling of synonyms, asymmetry to correct for unequal cluster sizes, a tunable diffusion parameter $K$, and a parameterization strategy that yields interpretable similarity distributions; case studies on European languages demonstrate the method's ability to reveal realistic mutual influences and relationships in multilingual territories. The approach offers a scalable, data-agnostic way to analyze lexical similarity beyond simple word-for-word ratios, with potential applications in diachronic linguistics and multilingual studies.
Abstract
In this paper, we present an algorithm for evaluating lexical similarity between a given language and several reference language clusters. As an input, we have a list of concepts and the corresponding translations in all considered languages. Moreover, each reference language is assigned to one of $c$ language clusters. For each of the concepts, the algorithm computes the distance between each pair of translations. Based on these distances, it constructs a weighted directed graph, where every vertex represents a language. After, it solves a graph diffusion equation with a Dirichlet boundary condition, where the unknown is a map from the vertex set to $\mathbb{R}^c$. The resulting coordinates are values from the interval $[0,1]$ and they can be interpreted as probabilities of belonging to each of the clusters or as a lexical similarity distribution with respect to the reference clusters. The distances between translations are calculated using phonetic transcriptions and a modification of the Damerau-Levenshtein distance. The algorithm can be useful in analyzing relationships between languages spoken in multilingual territories with a lot of mutual influences. We demonstrate this by presenting a case study regarding various European languages.
