AfroXLMR-Comet: Multilingual Knowledge Distillation with Attention Matching for Low-Resource languages
Joshua Sakthivel Raju, Sanjay S, Jaskaran Singh Walia, Srinivas Raghav, Vukosi Marivate
TL;DR
The paper tackles the challenge of deploying efficient multilingual transformers for low-resource African languages by introducing AfroXLMR-Comet, a highly compact student model trained via a hybrid distillation framework that combines standard knowledge distillation with a simplified attention matching mechanism. The teacher–student setup uses AfroXLMR-Large as the teacher and a 68.9M-parameter AfroXLMR-Comet as the student, achieving an approximately 88% reduction in parameters while maintaining competitive performance on AfriSenti-SemEval across five languages. The method employs mean-pooled attention matching, a two-stage task-agnostic training regime with soft targets and a projection layer to align attentions, and demonstrates strong practical benefits in terms of inference speed and memory footprint. This work advances practical multilingual NLP for low-resource languages by enabling deployment-durable, efficient models with demonstrated applicability on standardized African-language sentiment tasks.
Abstract
Language model compression through knowledge distillation has emerged as a promising approach for deploying large language models in resource-constrained environments. However, existing methods often struggle to maintain performance when distilling multilingual models, especially for low-resource languages. In this paper, we present a novel hybrid distillation approach that combines traditional knowledge distillation with a simplified attention matching mechanism, specifically designed for multilingual contexts. Our method introduces an extremely compact student model architecture, significantly smaller than conventional multilingual models. We evaluate our approach on five African languages: Kinyarwanda, Swahili, Hausa, Igbo, and Yoruba. The distilled student model; AfroXLMR-Comet successfully captures both the output distribution and internal attention patterns of a larger teacher model (AfroXLMR-Large) while reducing the model size by over 85%. Experimental results demonstrate that our hybrid approach achieves competitive performance compared to the teacher model, maintaining an accuracy within 85% of the original model's performance while requiring substantially fewer computational resources. Our work provides a practical framework for deploying efficient multilingual models in resource-constrained environments, particularly benefiting applications involving African languages.
