Table of Contents
Fetching ...

Memory-efficient NLLB-200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model

Yeskendir Koishekenov, Alexandre Berard, Vassilina Nikoulina

TL;DR

This work targets memory-efficient deployment of the 54.5B NLLB-200 multilingual MoE MT model by pruning experts at inference without fine-tuning. It introduces expert-pruning metrics derived from gate statistics, compares pruning strategies across granularities (global, language-pair, language-specific), and demonstrates that up to $80\%$ of experts can be removed with negligible translation quality loss. The authors find language-specific and language-pair experts emerge in NLLB-200, with decoder experts tending to cluster by linguistically related languages, enabling practical one-GPU decoding when pruned appropriately. The results reveal that language-aware pruning (particularly per-language granularity with a 3:1 encoder-to-decoder ratio) matches or closely approaches the full MoE model while delivering meaningful speedups and memory reductions, making massively multilingual MT more accessible for deployment. They also release pruning statistics and expert IDs to facilitate reproducible, memory-efficient use of NLLB-200 on a single GPU.

Abstract

The recently released NLLB-200 is a set of multilingual Neural Machine Translation models that cover 202 languages. The largest model is based on a Mixture of Experts architecture and achieves SoTA results across many language pairs. It contains 54.5B parameters and requires at least four 32GB GPUs just for inference. In this work, we propose a pruning method that enables the removal of up to 80% of experts without further finetuning and with a negligible loss in translation quality, which makes it feasible to run the model on a single 32GB GPU. Further analysis suggests that our pruning metrics can identify language-specific experts.

Memory-efficient NLLB-200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model

TL;DR

This work targets memory-efficient deployment of the 54.5B NLLB-200 multilingual MoE MT model by pruning experts at inference without fine-tuning. It introduces expert-pruning metrics derived from gate statistics, compares pruning strategies across granularities (global, language-pair, language-specific), and demonstrates that up to of experts can be removed with negligible translation quality loss. The authors find language-specific and language-pair experts emerge in NLLB-200, with decoder experts tending to cluster by linguistically related languages, enabling practical one-GPU decoding when pruned appropriately. The results reveal that language-aware pruning (particularly per-language granularity with a 3:1 encoder-to-decoder ratio) matches or closely approaches the full MoE model while delivering meaningful speedups and memory reductions, making massively multilingual MT more accessible for deployment. They also release pruning statistics and expert IDs to facilitate reproducible, memory-efficient use of NLLB-200 on a single GPU.

Abstract

The recently released NLLB-200 is a set of multilingual Neural Machine Translation models that cover 202 languages. The largest model is based on a Mixture of Experts architecture and achieves SoTA results across many language pairs. It contains 54.5B parameters and requires at least four 32GB GPUs just for inference. In this work, we propose a pruning method that enables the removal of up to 80% of experts without further finetuning and with a negligible loss in translation quality, which makes it feasible to run the model on a single 32GB GPU. Further analysis suggests that our pruning metrics can identify language-specific experts.
Paper Structure (28 sections, 5 equations, 9 figures, 16 tables)

This paper contains 28 sections, 5 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: Average number of experts per layer after pruning 75% of experts with the global threshold algorithm (average activity threshold: 0.69). Pruning is done per language direction and the values are averaged over the 870 directions of the valid set.
  • Figure 2: chrF++ and spBLEU valid scores on 30 languages for different resource types as a function of the percentage of experts retained. Pruning is done per language pair with the importance metric and with a fixed number of experts per layer.
  • Figure 3: Hierarchical clustering of languages based on the importance metric of experts in the decoder. Different colors represent different language subgroupings.
  • Figure 4: spBLEU valid scores on 30 languages for different resource types as a function of the percentage of experts retained. Pruning is done at the language pair granularity with the importance metric and with a fixed number of experts per layer.
  • Figure 5: Jaccard similarity of selected 25% decoder experts for different languages. Pruning was done per language with the importance metric and enc/dec threshold pruning. Languages are sorted by language family.
  • ...and 4 more figures