Distilling Multilingual Vision-Language Models: When Smaller Models Stay Multilingual
Sukrit Sriratanawilai, Jhayahgrit Thongwat, Romrawin Chumpu, Patomporn Payoungkhamdee, Sarana Nutanong, Peerat Limkonchotiwat
TL;DR
This work systematically analyzes knowledge distillation strategies for multilingual vision–language models under compression. It introduces five KD losses—FD, ED, SD, MCL, and DR—and a multi-objective framework to study their effects on cross-lingual alignment and downstream performance. The results show distributional Replication (DR) generally yields the best multilingual retrieval, while English-Control Distillation (ED) shines in visual question answering; combining objectives improves some tasks but can harm others, highlighting task-sensitive trade-offs. Crucially, a multilingual student can approach or match the teacher on average while offering strong multilingual consistency in clustering and ranking, suggesting practical paths for deploying smaller, multilingual VLMs. The findings imply that careful KD design enables efficient, language-diverse VLM deployment without substantial sacrifices in cross-lingual retrieval or robustness.
Abstract
Vision-language models (VLMs) exhibit uneven performance across languages, a problem that is often exacerbated when the model size is reduced. While Knowledge distillation (KD) demonstrates promising results in transferring knowledge from larger to smaller VLMs, applying KD in multilingualism is an underexplored area. This paper presents a controlled empirical study of KD behavior across five distillation approaches, isolating their effects on cross-lingual representation consistency and downstream performance stability under model compression. We study five distillation formulations across CLIP and SigLIP2, and evaluate them on in-domain retrieval and out-of-domain visual QA. We find that some configurations preserve or even improve multilingual retrieval robustness despite halving model size, but others fail to maintain cross-task stability, exposing design-sensitive trade-offs that aggregate accuracy alone does not reveal.
