KD4MT: A Survey of Knowledge Distillation for Machine Translation
Ona de Gibert, Joseph Attieh, Timothee Mickus, Yves Scherrer, Jörg Tiedemann
TL;DR
KD4MT surveys how knowledge distillation reshapes supervision in machine translation beyond mere compression, examining Word-KD, Seq-KD, and feature-based KD and extensions like multi-teacher, proxy-task, and LLM-based approaches. It covers applications across multilingual MT, low-resource MT, domain adaptation, and time-sensitive settings, and highlights practical guidelines, risks (hallucination, memorization, bias), and evaluation gaps. The survey provides a public database and glossary to support reproducibility and emphasizes the evolving role of LLMs as data sources and supervision signals. It argues that KD in MT is a mechanism for regularization and knowledge transfer that can improve generalization and efficiency, while calling for careful experimental design, broader language coverage, and robust evaluation practices.
Abstract
Knowledge Distillation (KD) as a research area has gained a lot of traction in recent years as a compression tool to address challenges related to ever-larger models in NLP. Remarkably, Machine Translation (MT) offers a much more nuanced take on this narrative: in MT, KD also functions as a general-purpose knowledge transfer mechanism that shapes supervision and translation quality as well as efficiency. This survey synthesizes KD for MT (KD4MT) across 105 papers (through October 1, 2025). We begin by introducing both MT and KD for non-experts, followed by an overview of the standard KD approaches relevant to MT applications. Subsequently, we categorize advances in the KD4MT literature based on (i) their methodological contributions and (ii) their practical applications. Our qualitative and quantitative analyses identify common trends in the field and highlight key research gaps as well as the absence of unified evaluation practice for KD methods in MT. We further provide practical guidelines for selecting a KD method in concrete settings and highlight potential risks associated with the application of KD to MT such as increased hallucination and bias amplification. Finally, we discuss the role of LLMs in re-shaping the KD4MT field. To support further research, we complement our survey with a publicly available database summarizing the main characteristics of the surveyed KD methods and a glossary of key terms.
