URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base

Aditya Khan; Mason Shipton; David Anugraha; Kaiyao Duan; Phuong H. Hoang; Eric Khiu; A. Seza Doğruöz; En-Shiun Annie Lee

URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base

Aditya Khan, Mason Shipton, David Anugraha, Kaiyao Duan, Phuong H. Hoang, Eric Khiu, A. Seza Doğruöz, En-Shiun Annie Lee

TL;DR

URIEL+ extends the URIEL knowledge base by integrating five additional databases to expand typological feature coverage to 2898 languages, introduces robust, customizable distance calculations with confidence scores, and adds automatic imputation methods to handle missing data. It replaces pre-computed distances with dynamic queries, enabling up-to-date, configurable similarity measures, while providing explicit metrics for data completeness, consistency, and imputation quality. Empirical validation shows increased feature coverage, improved imputation quality, and better or comparable performance on downstream NLP tasks (LangRank, LinguAlchemy, ProxyLM), along with a case study indicating closer alignment with linguistic typological distance measures. Collectively, URIEL+ advances linguistic inclusion of low-resource languages and enhances usability for multilingual NLP research and applications, with open-source openness inviting community contributions.

Abstract

URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec that addresses these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves the user experience with robust, customizable distance calculations to better suit the needs of users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies.

URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base

TL;DR

Abstract

Paper Structure (46 sections, 4 equations, 2 figures, 13 tables)

This paper contains 46 sections, 4 equations, 2 figures, 13 tables.

Introduction
Syntactic Distance
Phonological Distance
Phonemic Inventory Distance
Feature Coverage Expansion
Data Integrity and Imputation
Robust Distance Calculations with Confidence Scores
From URIEL to URIEL+
Integrating New Databases
Binarization for Non-Binary Features
Combining Redundant Features
Classifying and Renaming Features
Incorporating Glottocode Identifiers
Summary of Implementation Details
Automatic Imputation Algorithms
...and 31 more sections

Figures (2)

Figure 2: Number of languages with available syntactic (syn), phonological (pho), inventory (inv), and morphological (mor) data in URIEL and URIEL+ with all five databases.
Figure 3: Number of languages with available syntactic (syn), phonological (pho), inventory (inv), and morphological (mor) data in URIEL and URIEL+ with all five databases, is shown for high resource languages (HRLs), medium resource languages (MRLs), and low resource languages (LRLs) joshi-etal-2020-state from left to right.

URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base

TL;DR

Abstract

URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base

Authors

TL;DR

Abstract

Table of Contents

Figures (2)