Memory-efficient NLLB-200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model
Yeskendir Koishekenov, Alexandre Berard, Vassilina Nikoulina
TL;DR
This work targets memory-efficient deployment of the 54.5B NLLB-200 multilingual MoE MT model by pruning experts at inference without fine-tuning. It introduces expert-pruning metrics derived from gate statistics, compares pruning strategies across granularities (global, language-pair, language-specific), and demonstrates that up to $80\%$ of experts can be removed with negligible translation quality loss. The authors find language-specific and language-pair experts emerge in NLLB-200, with decoder experts tending to cluster by linguistically related languages, enabling practical one-GPU decoding when pruned appropriately. The results reveal that language-aware pruning (particularly per-language granularity with a 3:1 encoder-to-decoder ratio) matches or closely approaches the full MoE model while delivering meaningful speedups and memory reductions, making massively multilingual MT more accessible for deployment. They also release pruning statistics and expert IDs to facilitate reproducible, memory-efficient use of NLLB-200 on a single GPU.
Abstract
The recently released NLLB-200 is a set of multilingual Neural Machine Translation models that cover 202 languages. The largest model is based on a Mixture of Experts architecture and achieves SoTA results across many language pairs. It contains 54.5B parameters and requires at least four 32GB GPUs just for inference. In this work, we propose a pruning method that enables the removal of up to 80% of experts without further finetuning and with a negligible loss in translation quality, which makes it feasible to run the model on a single 32GB GPU. Further analysis suggests that our pruning metrics can identify language-specific experts.
