Table of Contents
Fetching ...

Targeted Multilingual Adaptation for Low-resource Language Families

C. M. Downey, Terra Blevins, Dhwani Serai, Dwija Parikh, Shane Steinert-Threlkeld

TL;DR

This work investigates targeted multilingual adaptation of a pre-trained cross-lingual model to a language family, using the Uralic languages as a case study. By combining Language-Adaptive Pre-Training (LAPT) with vocabulary specialization tailored to the family, it demonstrates that adapting XLM-R to 15 Uralic languages yields substantial gains over mono- and multilingual baselines across POS tagging and dependency parsing. A large-scale regression analysis identifies that both the number of LAPT steps and the size of the specialized vocabulary positively affect performance, while the language-sampling parameter $ abla$ benefits low-resource languages significantly, with limited impact on high-resource ones; notably, small specialized vocabularies (e.g., 16k) can outperform the original 250k cross-lingual vocabulary. The paper also provides an error analysis for Skolt Sámi and distills practical best practices: prioritize multilingual adaptation over single-language tricks, prefer vocabulary specialization for efficiency, and use low $ abla$ to up-sample low-resource languages. The findings offer actionable guidance for extending pre-trained models to under-resourced languages and are accompanied by open-source adaptation code and models.

Abstract

The "massively-multilingual" training of multilingual models is known to limit their utility in any one language, and they perform particularly poorly on low-resource languages. However, there is evidence that low-resource languages can benefit from targeted multilinguality, where the model is trained on closely related languages. To test this approach more rigorously, we systematically study best practices for adapting a pre-trained model to a language family. Focusing on the Uralic family as a test case, we adapt XLM-R under various configurations to model 15 languages; we then evaluate the performance of each experimental setting on two downstream tasks and 11 evaluation languages. Our adapted models significantly outperform mono- and multilingual baselines. Furthermore, a regression analysis of hyperparameter effects reveals that adapted vocabulary size is relatively unimportant for low-resource languages, and that low-resource languages can be aggressively up-sampled during training at little detriment to performance in high-resource languages. These results introduce new best practices for performing language adaptation in a targeted setting.

Targeted Multilingual Adaptation for Low-resource Language Families

TL;DR

This work investigates targeted multilingual adaptation of a pre-trained cross-lingual model to a language family, using the Uralic languages as a case study. By combining Language-Adaptive Pre-Training (LAPT) with vocabulary specialization tailored to the family, it demonstrates that adapting XLM-R to 15 Uralic languages yields substantial gains over mono- and multilingual baselines across POS tagging and dependency parsing. A large-scale regression analysis identifies that both the number of LAPT steps and the size of the specialized vocabulary positively affect performance, while the language-sampling parameter benefits low-resource languages significantly, with limited impact on high-resource ones; notably, small specialized vocabularies (e.g., 16k) can outperform the original 250k cross-lingual vocabulary. The paper also provides an error analysis for Skolt Sámi and distills practical best practices: prioritize multilingual adaptation over single-language tricks, prefer vocabulary specialization for efficiency, and use low to up-sample low-resource languages. The findings offer actionable guidance for extending pre-trained models to under-resourced languages and are accompanied by open-source adaptation code and models.

Abstract

The "massively-multilingual" training of multilingual models is known to limit their utility in any one language, and they perform particularly poorly on low-resource languages. However, there is evidence that low-resource languages can benefit from targeted multilinguality, where the model is trained on closely related languages. To test this approach more rigorously, we systematically study best practices for adapting a pre-trained model to a language family. Focusing on the Uralic family as a test case, we adapt XLM-R under various configurations to model 15 languages; we then evaluate the performance of each experimental setting on two downstream tasks and 11 evaluation languages. Our adapted models significantly outperform mono- and multilingual baselines. Furthermore, a regression analysis of hyperparameter effects reveals that adapted vocabulary size is relatively unimportant for low-resource languages, and that low-resource languages can be aggressively up-sampled during training at little detriment to performance in high-resource languages. These results introduce new best practices for performing language adaptation in a targeted setting.
Paper Structure (37 sections, 6 equations, 5 figures, 15 tables)

This paper contains 37 sections, 6 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: Uralic data composition by number of lines, on a log scale. The actual data quantities are shown with bars, while sampling distributions with several values of the $\alpha$ parameter are plotted as lines
  • Figure 2: Few-shot UAS --- effect of hyper-parameters by language, marginalized across other parameter settings
  • Figure 3: Zero-shot UAS --- effect of hyper-parameters by language, marginalized across other parameter settings
  • Figure 4: Few-shot POS --- effect of hyper-parameters by language, marginalized across other parameter settings
  • Figure 5: Zero-shot POS --- effect of hyper-parameters by language, marginalized across other parameter settings