Measuring cross-language intelligibility between Romance languages with computational tools
Liviu P Dinu, Ana Sabina Uban, Bogdan Iordache, Anca Dinu, Simona Georgescu
TL;DR
The paper addresses measuring cross-language intelligibility among Romance languages with a novel computational framework that integrates lexical overlap, surface form similarity, and semantic meaning. It introduces the Lexical Intelligibility index $D_{LI}$, which combines surface similarity $S_L$ and semantic similarity $S_S$ and is computed over cognate/borrowing pairs drawn from RoBoCoP and evaluated on the RomCro and EuroParl corpora, using both orthographic/phonetic forms and static/contextual embeddings. Findings reveal asymmetric intelligibility patterns, with phonetic information generally reducing scores and Romanian-involving pairs exhibiting the largest asymmetries; the $D_{LI}$ scores correlate with human cloze test results for RomCro ($\rho = 0.71$, $p = 0.0013$), supporting the metric’s validity while highlighting corpus- and domain-dependence. The work provides a scalable benchmark for inherent intelligibility in closely related languages and emphasizes the influence of data resources and representation choices on cross-language understanding.
Abstract
We present an analysis of mutual intelligibility in related languages applied for languages in the Romance family. We introduce a novel computational metric for estimating intelligibility based on lexical similarity using surface and semantic similarity of related words, and use it to measure mutual intelligibility for the five main Romance languages (French, Italian, Portuguese, Spanish, and Romanian), and compare results using both the orthographic and phonetic forms of words as well as different parallel corpora and vectorial models of word meaning representation. The obtained intelligibility scores confirm intuitions related to intelligibility asymmetry across languages and significantly correlate with results of cloze tests in human experiments.
