Table of Contents
Fetching ...

A Reproducibility Study on Quantifying Language Similarity: The Impact of Missing Values in the URIEL Knowledge Base

Hasti Toossi, Guo Qing Huai, Jinyu Liu, Eric Khiu, A. Seza Doğruöz, En-Shiun Annie Lee

TL;DR

This study critically assesses the reproducibility and reliability of URIEL's language similarity measurements, focusing on how missing values and ambiguities in distance calculations affect downstream use in multilingual NLP. By attempting to reproduce aggregated feature vectors and pre-computed distances, the authors find that union and average aggregations are reproducible while $k$NN$ aggregation remains unclear due to missing procedural details, with regularized angular distance ($2D_ heta$) and union/average vectors yielding the most reproducible distances. A substantial portion of URIEL's languages (approximately 31.24%) lack any feature information, and many feature vectors are partially or wholly missing, undermining the meaningfulness of many distance values, especially for low-resource languages. The literature review shows URIEL/lang2vec are widely cited across cross-lingual modelling, performance prediction, and translation tasks, but also highlights biases from missing-value predictions and calls for clearer handling of missing data and broader, more representative coverage. Overall, the work provides concrete recommendations to improve data quality and methodological transparency, which are essential for reliable multilingual linguistic analyses and fair applicability to low-resource languages.

Abstract

In the pursuit of supporting more languages around the world, tools that characterize properties of languages play a key role in expanding the existing multilingual NLP research. In this study, we focus on a widely used typological knowledge base, URIEL, which aggregates linguistic information into numeric vectors. Specifically, we delve into the soundness and reproducibility of the approach taken by URIEL in quantifying language similarity. Our analysis reveals URIEL's ambiguity in calculating language distances and in handling missing values. Moreover, we find that URIEL does not provide any information about typological features for 31\% of the languages it represents, undermining the reliabilility of the database, particularly on low-resource languages. Our literature review suggests URIEL and lang2vec are used in papers on diverse NLP tasks, which motivates us to rigorously verify the database as the effectiveness of these works depends on the reliability of the information the tool provides.

A Reproducibility Study on Quantifying Language Similarity: The Impact of Missing Values in the URIEL Knowledge Base

TL;DR

This study critically assesses the reproducibility and reliability of URIEL's language similarity measurements, focusing on how missing values and ambiguities in distance calculations affect downstream use in multilingual NLP. By attempting to reproduce aggregated feature vectors and pre-computed distances, the authors find that union and average aggregations are reproducible while NN2D_ heta$) and union/average vectors yielding the most reproducible distances. A substantial portion of URIEL's languages (approximately 31.24%) lack any feature information, and many feature vectors are partially or wholly missing, undermining the meaningfulness of many distance values, especially for low-resource languages. The literature review shows URIEL/lang2vec are widely cited across cross-lingual modelling, performance prediction, and translation tasks, but also highlights biases from missing-value predictions and calls for clearer handling of missing data and broader, more representative coverage. Overall, the work provides concrete recommendations to improve data quality and methodological transparency, which are essential for reliable multilingual linguistic analyses and fair applicability to low-resource languages.

Abstract

In the pursuit of supporting more languages around the world, tools that characterize properties of languages play a key role in expanding the existing multilingual NLP research. In this study, we focus on a widely used typological knowledge base, URIEL, which aggregates linguistic information into numeric vectors. Specifically, we delve into the soundness and reproducibility of the approach taken by URIEL in quantifying language similarity. Our analysis reveals URIEL's ambiguity in calculating language distances and in handling missing values. Moreover, we find that URIEL does not provide any information about typological features for 31\% of the languages it represents, undermining the reliabilility of the database, particularly on low-resource languages. Our literature review suggests URIEL and lang2vec are used in papers on diverse NLP tasks, which motivates us to rigorously verify the database as the effectiveness of these works depends on the reliability of the information the tool provides.
Paper Structure (20 sections, 3 equations, 5 figures, 1 table)

This paper contains 20 sections, 3 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: URIEL Feature Hierarchy and Data Sources.
  • Figure 2: Number of languages with non-empty union feature vectors
  • Figure 3: Distribution of languages based on the number of non-missing features in the union vector for each category, excluding languages with empty feature vectors.
  • Figure 4: Distribution of the top 200 most spoken languages based on the number of non-missing features in the union vector for each category, excluding languages with empty feature vectors.
  • Figure 5: Number of languages with non-empty union feature vectors in all language families.