Table of Contents
Fetching ...

data2lang2vec: Data Driven Typological Features Completion

Hamidreza Amirzadeh, Sadegh Jafari, Anika Harju, Rob van der Goot

TL;DR

A multi-lingual Part-of-Speech (POS) tagger is introduced, achieving over 70\% accuracy across 1,749 languages, and a more realistic evaluation setup is introduced, focusing on likely to be missing typology features, and it is shown that this approach outperforms previous work in both setups.

Abstract

Language typology databases enhance multi-lingual Natural Language Processing (NLP) by improving model adaptability to diverse linguistic structures. The widely-used lang2vec toolkit integrates several such databases, but its coverage remains limited at 28.9\%. Previous work on automatically increasing coverage predicts missing values based on features from other languages or focuses on single features, we propose to use textual data for better-informed feature prediction. To this end, we introduce a multi-lingual Part-of-Speech (POS) tagger, achieving over 70\% accuracy across 1,749 languages, and experiment with external statistical features and a variety of machine learning algorithms. We also introduce a more realistic evaluation setup, focusing on likely to be missing typology features, and show that our approach outperforms previous work in both setups.

data2lang2vec: Data Driven Typological Features Completion

TL;DR

A multi-lingual Part-of-Speech (POS) tagger is introduced, achieving over 70\% accuracy across 1,749 languages, and a more realistic evaluation setup is introduced, focusing on likely to be missing typology features, and it is shown that this approach outperforms previous work in both setups.

Abstract

Language typology databases enhance multi-lingual Natural Language Processing (NLP) by improving model adaptability to diverse linguistic structures. The widely-used lang2vec toolkit integrates several such databases, but its coverage remains limited at 28.9\%. Previous work on automatically increasing coverage predicts missing values based on features from other languages or focuses on single features, we propose to use textual data for better-informed feature prediction. To this end, we introduce a multi-lingual Part-of-Speech (POS) tagger, achieving over 70\% accuracy across 1,749 languages, and experiment with external statistical features and a variety of machine learning algorithms. We also introduce a more realistic evaluation setup, focusing on likely to be missing typology features, and show that our approach outperforms previous work in both setups.
Paper Structure (21 sections, 11 figures, 4 tables)

This paper contains 21 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Missing ratio distribution of our target features. Higher bars indicate that feature has more probable missing values, and vice versa. Features with a missing ratio above 0.5 are listed in Table \ref{['tab:second_eval']}.
  • Figure 2: Optimization history for 10 trials, with the objective value being the F1-score for GBC Classifier.
  • Figure 3: Optimization history for 10 trials in the GBC classifier for the aes_status, feat_id, and feat_name features.
  • Figure 4: Optimization history for 10 trials in the GBC classifier for the geo_lat, geo_long, and lang_fam features.
  • Figure 5: Optimization history for 10 trials in the GBC classifier for the lang_group, lang_id, and learning_rate features.
  • ...and 6 more figures