KD4MT: A Survey of Knowledge Distillation for Machine Translation

Ona de Gibert; Joseph Attieh; Timothee Mickus; Yves Scherrer; Jörg Tiedemann

KD4MT: A Survey of Knowledge Distillation for Machine Translation

Ona de Gibert, Joseph Attieh, Timothee Mickus, Yves Scherrer, Jörg Tiedemann

TL;DR

KD4MT surveys how knowledge distillation reshapes supervision in machine translation beyond mere compression, examining Word-KD, Seq-KD, and feature-based KD and extensions like multi-teacher, proxy-task, and LLM-based approaches. It covers applications across multilingual MT, low-resource MT, domain adaptation, and time-sensitive settings, and highlights practical guidelines, risks (hallucination, memorization, bias), and evaluation gaps. The survey provides a public database and glossary to support reproducibility and emphasizes the evolving role of LLMs as data sources and supervision signals. It argues that KD in MT is a mechanism for regularization and knowledge transfer that can improve generalization and efficiency, while calling for careful experimental design, broader language coverage, and robust evaluation practices.

Abstract

Knowledge Distillation (KD) as a research area has gained a lot of traction in recent years as a compression tool to address challenges related to ever-larger models in NLP. Remarkably, Machine Translation (MT) offers a much more nuanced take on this narrative: in MT, KD also functions as a general-purpose knowledge transfer mechanism that shapes supervision and translation quality as well as efficiency. This survey synthesizes KD for MT (KD4MT) across 105 papers (through October 1, 2025). We begin by introducing both MT and KD for non-experts, followed by an overview of the standard KD approaches relevant to MT applications. Subsequently, we categorize advances in the KD4MT literature based on (i) their methodological contributions and (ii) their practical applications. Our qualitative and quantitative analyses identify common trends in the field and highlight key research gaps as well as the absence of unified evaluation practice for KD methods in MT. We further provide practical guidelines for selecting a KD method in concrete settings and highlight potential risks associated with the application of KD to MT such as increased hallucination and bias amplification. Finally, we discuss the role of LLMs in re-shaping the KD4MT field. To support further research, we complement our survey with a publicly available database summarizing the main characteristics of the surveyed KD methods and a glossary of key terms.

KD4MT: A Survey of Knowledge Distillation for Machine Translation

TL;DR

Abstract

Paper Structure (35 sections, 8 equations, 6 figures, 2 tables)

This paper contains 35 sections, 8 equations, 6 figures, 2 tables.

Introduction
Preliminaries
The Task of Machine Translation
Non-autoregressive translation (NAT)
LLM-based translation
Foundations of KD for MT
Response-based KD
Word-Level KD (Word-KD)
Sequence-Level KD (Seq-KD)
Feature-based KD
Algorithms of KD for MT
Selecting Supervision
At the token level (Word-KD methods)
At the sentence level (Seq-KD methods)
At the representation level (feature-based methods)
...and 20 more sections

Figures (6)

Figure 1: Teacher versus student parameter counts. Dot size indicates frequency of a specific configuration, color marks the compression ratio $\frac{\mathrm{Teacher\ size}}{\mathrm{Student\ size}}$. Roughly half of the works use a ratio of 1. Best viewed in color.
Figure 2: Overview of the main KD methods for MT kim-rush-2016-sequence. The blue rectangles symbolize input embedding vectors, the red rectangles intermediate representations of (typically) transformer layers, and the yellow rectangles the probability distributions over output tokens. The dashed lines represent the supervision signal that is provided by the different KD approaches: the teacher provides the student with output distributions (word-level KD), decoded sequences (sequence-level KD), or intermediate representations (feature-based KD). Best viewed in color.
Figure 3: Breakdown of works surveyed per application and KD algorithm used. We surveyed 105 papers. However, some studies proposed methods utilizing more than one fundamental KD method, resulting in 115 unique approaches overall.
Figure 4: Cumulative number of papers using each KD type (Word-KD, Seq-KD, Feature-based KD) from 2016–2025.
Figure 5: Frequency of dataset and metric usage in the surveyed papers. Only datasets and metrics that appear more than twice are shown.
...and 1 more figures

KD4MT: A Survey of Knowledge Distillation for Machine Translation

TL;DR

Abstract

KD4MT: A Survey of Knowledge Distillation for Machine Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)