Table of Contents
Fetching ...

Measuring cross-language intelligibility between Romance languages with computational tools

Liviu P Dinu, Ana Sabina Uban, Bogdan Iordache, Anca Dinu, Simona Georgescu

TL;DR

The paper addresses measuring cross-language intelligibility among Romance languages with a novel computational framework that integrates lexical overlap, surface form similarity, and semantic meaning. It introduces the Lexical Intelligibility index $D_{LI}$, which combines surface similarity $S_L$ and semantic similarity $S_S$ and is computed over cognate/borrowing pairs drawn from RoBoCoP and evaluated on the RomCro and EuroParl corpora, using both orthographic/phonetic forms and static/contextual embeddings. Findings reveal asymmetric intelligibility patterns, with phonetic information generally reducing scores and Romanian-involving pairs exhibiting the largest asymmetries; the $D_{LI}$ scores correlate with human cloze test results for RomCro ($\rho = 0.71$, $p = 0.0013$), supporting the metric’s validity while highlighting corpus- and domain-dependence. The work provides a scalable benchmark for inherent intelligibility in closely related languages and emphasizes the influence of data resources and representation choices on cross-language understanding.

Abstract

We present an analysis of mutual intelligibility in related languages applied for languages in the Romance family. We introduce a novel computational metric for estimating intelligibility based on lexical similarity using surface and semantic similarity of related words, and use it to measure mutual intelligibility for the five main Romance languages (French, Italian, Portuguese, Spanish, and Romanian), and compare results using both the orthographic and phonetic forms of words as well as different parallel corpora and vectorial models of word meaning representation. The obtained intelligibility scores confirm intuitions related to intelligibility asymmetry across languages and significantly correlate with results of cloze tests in human experiments.

Measuring cross-language intelligibility between Romance languages with computational tools

TL;DR

The paper addresses measuring cross-language intelligibility among Romance languages with a novel computational framework that integrates lexical overlap, surface form similarity, and semantic meaning. It introduces the Lexical Intelligibility index , which combines surface similarity and semantic similarity and is computed over cognate/borrowing pairs drawn from RoBoCoP and evaluated on the RomCro and EuroParl corpora, using both orthographic/phonetic forms and static/contextual embeddings. Findings reveal asymmetric intelligibility patterns, with phonetic information generally reducing scores and Romanian-involving pairs exhibiting the largest asymmetries; the scores correlate with human cloze test results for RomCro (, ), supporting the metric’s validity while highlighting corpus- and domain-dependence. The work provides a scalable benchmark for inherent intelligibility in closely related languages and emphasizes the influence of data resources and representation choices on cross-language understanding.

Abstract

We present an analysis of mutual intelligibility in related languages applied for languages in the Romance family. We introduce a novel computational metric for estimating intelligibility based on lexical similarity using surface and semantic similarity of related words, and use it to measure mutual intelligibility for the five main Romance languages (French, Italian, Portuguese, Spanish, and Romanian), and compare results using both the orthographic and phonetic forms of words as well as different parallel corpora and vectorial models of word meaning representation. The obtained intelligibility scores confirm intuitions related to intelligibility asymmetry across languages and significantly correlate with results of cloze tests in human experiments.
Paper Structure (17 sections, 14 equations, 7 figures, 2 tables)

This paper contains 17 sections, 14 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Lexical intelligibility scores based on the RomCro corpus using static embeddings.
  • Figure 2: $D_{LI}^s$ index (based on RomCro) and cloze test results, for each language pair $A-B$ ($A$ as speaker language and $B$ as listener language), colored according to speaker's language.
  • Figure 3: Embeddings coverage.
  • Figure 4: Distribution of $D_{LI}$ scores using static embeddings and orthographic vs phonetic surface similarity.
  • Figure 5: Distribution of $D_{LI}$ scores using orthographic similarity and static vs contextual embeddings for semantic similarity.
  • ...and 2 more figures