Table of Contents
Fetching ...

Simple Additions, Substantial Gains: Expanding Scripts, Languages, and Lineage Coverage in URIEL+

Mason Shipton, York Hay Ng, Aditya Khan, Phuong Hanh Hoang, Xiang Lu, A. Seza Doğruöz, En-Shiun Annie Lee

TL;DR

URIEL+ expands a multilingual knowledge base by adding script vectors for $7{,}488$ languages, integrating Glottolog to include $18{,}710$ additional languages, and extending lineage imputation to $26{,}449$ languages. These contributions reduce sparsity ($14\%$ for scripts), increase language coverage (up to $19{,}015$ languages, $1{,}007\%$), and improve imputation quality by up to $33\%$, with cross-lingual transfer gains up to $6\%$ in certain setups. Script distances provide largely orthogonal information to existing URIEL+ distances, as shown by Mantel tests, underscoring their complementary value for language similarity modeling. Overall, the updated URIEL+ offers more complete and inclusive coverage for low-resource languages, enabling more robust multilingual research and transfer scenarios.

Abstract

The URIEL+ linguistic knowledge base supports multilingual research by encoding languages through geographic, genetic, and typological vectors. However, data sparsity remains prevalent, in the form of missing feature types, incomplete language entries, and limited genealogical coverage. This limits the usefulness of URIEL+ in cross-lingual transfer, particularly for supporting low-resource languages. To address this sparsity, this paper extends URIEL+ with three contributions: introducing script vectors to represent writing system properties for 7,488 languages, integrating Glottolog to add 18,710 additional languages, and expanding lineage imputation for 26,449 languages by propagating typological and script features across genealogies. These additions reduce feature sparsity by 14% for script vectors, increase language coverage by up to 19,015 languages (1,007%), and improve imputation quality metrics by up to 33%. Our benchmark on cross-lingual transfer tasks (oriented around low-resource languages) shows occasionally divergent performance compared to URIEL+, with performance gains up to 6% in certain setups. Our advances make URIEL+ more complete and inclusive for multilingual research.

Simple Additions, Substantial Gains: Expanding Scripts, Languages, and Lineage Coverage in URIEL+

TL;DR

URIEL+ expands a multilingual knowledge base by adding script vectors for languages, integrating Glottolog to include additional languages, and extending lineage imputation to languages. These contributions reduce sparsity ( for scripts), increase language coverage (up to languages, ), and improve imputation quality by up to , with cross-lingual transfer gains up to in certain setups. Script distances provide largely orthogonal information to existing URIEL+ distances, as shown by Mantel tests, underscoring their complementary value for language similarity modeling. Overall, the updated URIEL+ offers more complete and inclusive coverage for low-resource languages, enabling more robust multilingual research and transfer scenarios.

Abstract

The URIEL+ linguistic knowledge base supports multilingual research by encoding languages through geographic, genetic, and typological vectors. However, data sparsity remains prevalent, in the form of missing feature types, incomplete language entries, and limited genealogical coverage. This limits the usefulness of URIEL+ in cross-lingual transfer, particularly for supporting low-resource languages. To address this sparsity, this paper extends URIEL+ with three contributions: introducing script vectors to represent writing system properties for 7,488 languages, integrating Glottolog to add 18,710 additional languages, and expanding lineage imputation for 26,449 languages by propagating typological and script features across genealogies. These additions reduce feature sparsity by 14% for script vectors, increase language coverage by up to 19,015 languages (1,007%), and improve imputation quality metrics by up to 33%. Our benchmark on cross-lingual transfer tasks (oriented around low-resource languages) shows occasionally divergent performance compared to URIEL+, with performance gains up to 6% in certain setups. Our advances make URIEL+ more complete and inclusive for multilingual research.

Paper Structure

This paper contains 28 sections, 1 equation, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of our additions to mitigating data sparsity in URIEL+. (1) Density map of $7,488$ languages with integrated script vectors. (2) Density map of the $18,710$ Glottolog child languages added to URIEL+. (3) Illustration of expanded lineage imputation, where missing values in child languages are filled from ancestors.
  • Figure 2: Number of languages with available syntactic (syn), phonological (pho), inventory (inv), morphological (mor), and script (scr) data in URIEL+ before and after expanded lineage imputation. All linguistic sources in URIEL+, as well as Glottolog languages, are integrated. Shown for high-resource (HRL), medium-resource (MRL), and low-resource (LRL) languages joshi-etal-2020-state from left to right.
  • Figure 3: Sparsity of language vectors by type (syntax, phonological, inventory, morphological, script) and by resource level, before expanded lineage imputation.
  • Figure 4: Percentage decrease in sparsity in language vectors by type (syntax, phonological, inventory, morphological, script) and by resource level, after expanded lineage imputation.