Table of Contents
Fetching ...

On Multilingual Encoder Language Model Compression for Low-Resource Languages

Daniil Gurgurov, Michal Gregor, Josef van Genabith, Simon Ostermann

TL;DR

This work tackles the challenge of deploying multilingual encoder-only language models in low-resource languages by proposing a four-stage compression pipeline that combines knowledge distillation, structured pruning, hidden-size truncation, and vocabulary trimming. By distilling across layers, pruning FFN width, truncating hidden dimensions, and trimming the tokenizer, the approach achieves up to 92% parameter reduction while typically incurring modest task degradation (about 2–10% for moderate compression and 8–13% at maximum compression) across four tasks. Key findings show that using a language-adapted teacher and appropriate initialization markedly improves distillation outcomes, and that the impact of compression depends on language data availability and task type (POS most robust, NER most sensitive, especially for Maltese). The results demonstrate practical, environmentally conscious pathways to deploy efficient, language-specific encoders for low-resource languages, with implications for broader accessibility and deployment of NLP tools. The authors also provide ablations identifying best practices and report that a substantial portion of performance loss can be mitigated through careful design choices.

Abstract

In this paper, we combine two-step knowledge distillation, structured pruning, truncation, and vocabulary trimming for extremely compressing multilingual encoder-only language models for low-resource languages. Our novel approach systematically combines existing techniques and takes them to the extreme, reducing layer depth, feed-forward hidden size, and intermediate layer embedding size to create significantly smaller monolingual models while retaining essential language-specific knowledge. We achieve compression rates of up to 92% while maintaining competitive performance, with average drops of 2-10% for moderate compression and 8-13% at maximum compression in four downstream tasks, including sentiment analysis, topic classification, named entity recognition, and part-of-speech tagging, across three low-resource languages. Notably, the performance degradation correlates with the amount of language-specific data in the teacher model, with larger datasets resulting in smaller performance losses. Additionally, we conduct ablation studies to identify the best practices for multilingual model compression using these techniques.

On Multilingual Encoder Language Model Compression for Low-Resource Languages

TL;DR

This work tackles the challenge of deploying multilingual encoder-only language models in low-resource languages by proposing a four-stage compression pipeline that combines knowledge distillation, structured pruning, hidden-size truncation, and vocabulary trimming. By distilling across layers, pruning FFN width, truncating hidden dimensions, and trimming the tokenizer, the approach achieves up to 92% parameter reduction while typically incurring modest task degradation (about 2–10% for moderate compression and 8–13% at maximum compression) across four tasks. Key findings show that using a language-adapted teacher and appropriate initialization markedly improves distillation outcomes, and that the impact of compression depends on language data availability and task type (POS most robust, NER most sensitive, especially for Maltese). The results demonstrate practical, environmentally conscious pathways to deploy efficient, language-specific encoders for low-resource languages, with implications for broader accessibility and deployment of NLP tools. The authors also provide ablations identifying best practices and report that a substantial portion of performance loss can be mitigated through careful design choices.

Abstract

In this paper, we combine two-step knowledge distillation, structured pruning, truncation, and vocabulary trimming for extremely compressing multilingual encoder-only language models for low-resource languages. Our novel approach systematically combines existing techniques and takes them to the extreme, reducing layer depth, feed-forward hidden size, and intermediate layer embedding size to create significantly smaller monolingual models while retaining essential language-specific knowledge. We achieve compression rates of up to 92% while maintaining competitive performance, with average drops of 2-10% for moderate compression and 8-13% at maximum compression in four downstream tasks, including sentiment analysis, topic classification, named entity recognition, and part-of-speech tagging, across three low-resource languages. Notably, the performance degradation correlates with the amount of language-specific data in the teacher model, with larger datasets resulting in smaller performance losses. Additionally, we conduct ablation studies to identify the best practices for multilingual model compression using these techniques.

Paper Structure

This paper contains 27 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Overview of our multilingual model compression methodology. We use (1) knowledge distillation to reduce layers, (2) structured pruning to eliminate redundant feed-forward network width, and (3) hidden size reduction and another round of knowledge distillation from the previous student model. Finally, (4) vocabulary trimming is applied to retain language-specific tokens.
  • Figure 2: First-step KD validation accuracies for mBERT and XLM-R with models initialized using the last $k$ layers. mBERT- and XLM-R-mt, sk, sw refer to models adapted to the target language; distilled denotes models trained with distillation loss, while student refers to identically trained models without distillation loss. The best accuracy is in all cases achieved when distilling from a target-language adapted model.
  • Figure 3: MSE vs. KD validation accuracy for mBERT with the models initialized using the last $k$ layers.
  • Figure 4: MSE vs. KD validation accuracy for XLM-R with the models initialized using the last $k$ layers.
  • Figure 5: Validation accuracy for various initialization strategies for mBERT.
  • ...and 5 more figures