Targeted Multilingual Adaptation for Low-resource Language Families
C. M. Downey, Terra Blevins, Dhwani Serai, Dwija Parikh, Shane Steinert-Threlkeld
TL;DR
This work investigates targeted multilingual adaptation of a pre-trained cross-lingual model to a language family, using the Uralic languages as a case study. By combining Language-Adaptive Pre-Training (LAPT) with vocabulary specialization tailored to the family, it demonstrates that adapting XLM-R to 15 Uralic languages yields substantial gains over mono- and multilingual baselines across POS tagging and dependency parsing. A large-scale regression analysis identifies that both the number of LAPT steps and the size of the specialized vocabulary positively affect performance, while the language-sampling parameter $ abla$ benefits low-resource languages significantly, with limited impact on high-resource ones; notably, small specialized vocabularies (e.g., 16k) can outperform the original 250k cross-lingual vocabulary. The paper also provides an error analysis for Skolt Sámi and distills practical best practices: prioritize multilingual adaptation over single-language tricks, prefer vocabulary specialization for efficiency, and use low $ abla$ to up-sample low-resource languages. The findings offer actionable guidance for extending pre-trained models to under-resourced languages and are accompanied by open-source adaptation code and models.
Abstract
The "massively-multilingual" training of multilingual models is known to limit their utility in any one language, and they perform particularly poorly on low-resource languages. However, there is evidence that low-resource languages can benefit from targeted multilinguality, where the model is trained on closely related languages. To test this approach more rigorously, we systematically study best practices for adapting a pre-trained model to a language family. Focusing on the Uralic family as a test case, we adapt XLM-R under various configurations to model 15 languages; we then evaluate the performance of each experimental setting on two downstream tasks and 11 evaluation languages. Our adapted models significantly outperform mono- and multilingual baselines. Furthermore, a regression analysis of hyperparameter effects reveals that adapted vocabulary size is relatively unimportant for low-resource languages, and that low-resource languages can be aggressively up-sampled during training at little detriment to performance in high-resource languages. These results introduce new best practices for performing language adaptation in a targeted setting.
