Table of Contents
Fetching ...

Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining

Nikola Ljubešić, Vít Suchomel, Peter Rupnik, Taja Kuzman, Rik van Noord

TL;DR

This study investigates cost-efficient encoder development for closely related South Slavic languages by comparing from-scratch training to additional pretraining of multilingual models (XLM-R). Using an extensive HBS data collection and a Slovenian-related corpus, the authors pretrain both base and large XLM-R models and evaluate on NER, sentiment, and COPA tasks. Key findings show that additional pretraining, especially for large models, yields competitive performance with limited computation, while including Slovenian generally incurs minimal or no loss and can enhance cross-lingual transfer. A notable drift phenomenon is observed on some tasks with extensive pretraining, suggesting a balance between leveraging multilingual knowledge and specialization. The authors release the XL-BERTić and XL-SloBERTić models and the large HBS data to support further work in encoder development for less-resourced languages.

Abstract

The world of language models is going through turbulent times, better and ever larger models are coming out at an unprecedented speed. However, we argue that, especially for the scientific community, encoder models of up to 1 billion parameters are still very much needed, their primary usage being in enriching large collections of data with metadata necessary for downstream research. We investigate the best way to ensure the existence of such encoder models on the set of very closely related languages - Croatian, Serbian, Bosnian and Montenegrin, by setting up a diverse benchmark for these languages, and comparing the trained-from-scratch models with the new models constructed via additional pretraining of existing multilingual models. We show that comparable performance to dedicated from-scratch models can be obtained by additionally pretraining available multilingual models even with a limited amount of computation. We also show that neighboring languages, in our case Slovenian, can be included in the additional pretraining with little to no loss in the performance of the final model.

Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining

TL;DR

This study investigates cost-efficient encoder development for closely related South Slavic languages by comparing from-scratch training to additional pretraining of multilingual models (XLM-R). Using an extensive HBS data collection and a Slovenian-related corpus, the authors pretrain both base and large XLM-R models and evaluate on NER, sentiment, and COPA tasks. Key findings show that additional pretraining, especially for large models, yields competitive performance with limited computation, while including Slovenian generally incurs minimal or no loss and can enhance cross-lingual transfer. A notable drift phenomenon is observed on some tasks with extensive pretraining, suggesting a balance between leveraging multilingual knowledge and specialization. The authors release the XL-BERTić and XL-SloBERTić models and the large HBS data to support further work in encoder development for less-resourced languages.

Abstract

The world of language models is going through turbulent times, better and ever larger models are coming out at an unprecedented speed. However, we argue that, especially for the scientific community, encoder models of up to 1 billion parameters are still very much needed, their primary usage being in enriching large collections of data with metadata necessary for downstream research. We investigate the best way to ensure the existence of such encoder models on the set of very closely related languages - Croatian, Serbian, Bosnian and Montenegrin, by setting up a diverse benchmark for these languages, and comparing the trained-from-scratch models with the new models constructed via additional pretraining of existing multilingual models. We show that comparable performance to dedicated from-scratch models can be obtained by additionally pretraining available multilingual models even with a limited amount of computation. We also show that neighboring languages, in our case Slovenian, can be included in the additional pretraining with little to no loss in the performance of the final model.
Paper Structure (52 sections, 1 figure, 12 tables)

This paper contains 52 sections, 1 figure, 12 tables.

Figures (1)

  • Figure 1: Performance of models on different tasks in relation to the round of additional pretraining. $r=0$ is referring to round 0, before any additional pretraining, and thus represents the performance of the XLM-RoBERTa-base and XLM-RoBERTa-large models. Subsequent 8 datapoints represent stages of additional pretraining. One round equals 12k steps for the base model (XB-BERTić), and 6k steps for large models (XL-BERTić and XL-SloBERTić), in this way identical amount of computation per round was assured regardless of model size. The performance of cseBERT and BERTić is depicted with a black dashed line.