Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation

Songming Zhang; Yunlong Liang; Shuaibo Wang; Wenjuan Han; Jian Liu; Jinan Xu; Yufeng Chen

Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation

Songming Zhang, Yunlong Liang, Shuaibo Wang, Wenjuan Han, Jian Liu, Jinan Xu, Yufeng Chen

TL;DR

This work investigates where knowledge resides in knowledge distillation for neural machine translation and uncovers that the teacher's top-1 information predominantly drives KD gains. It establishes a link between word- and sequence-level KD by showing a shared emphasis on top-1 predictions and introduces Top-1 Information Enhanced KD (TIE-KD), which combines a hierarchical ranking loss with an iterative KD procedure to leverage data without ground-truth targets. Empirical results across three WMT benchmarks show that TIE-KD consistently improves Transformer base students over vanilla Word-KD and can match or exceed sequence-level KD in certain settings, with strong generalization across teacher-student capacity gaps. The findings offer a principled perspective on KD mechanisms in NMT and deliver practical gains for model compression and deployment.

Abstract

Knowledge distillation (KD) is a promising technique for model compression in neural machine translation. However, where the knowledge hides in KD is still not clear, which may hinder the development of KD. In this work, we first unravel this mystery from an empirical perspective and show that the knowledge comes from the top-1 predictions of teachers, which also helps us build a potential connection between word- and sequence-level KD. Further, we point out two inherent issues in vanilla word-level KD based on this finding. Firstly, the current objective of KD spreads its focus to whole distributions to learn the knowledge, yet lacks special treatment on the most crucial top-1 information. Secondly, the knowledge is largely covered by the golden information due to the fact that most top-1 predictions of teachers overlap with ground-truth tokens, which further restricts the potential of KD. To address these issues, we propose a novel method named \textbf{T}op-1 \textbf{I}nformation \textbf{E}nhanced \textbf{K}nowledge \textbf{D}istillation (TIE-KD). Specifically, we design a hierarchical ranking loss to enforce the learning of the top-1 information from the teacher. Additionally, we develop an iterative KD procedure to infuse more additional knowledge by distilling on the data without ground-truth targets. Experiments on WMT'14 English-German, WMT'14 English-French and WMT'16 English-Romanian demonstrate that our method can respectively boost Transformer$_{base}$ students by +1.04, +0.60 and +1.11 BLEU scores and significantly outperform the vanilla word-level KD baseline. Besides, our method shows higher generalizability on different teacher-student capacity gaps than existing KD techniques.

Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation

TL;DR

Abstract

students by +1.04, +0.60 and +1.11 BLEU scores and significantly outperform the vanilla word-level KD baseline. Besides, our method shows higher generalizability on different teacher-student capacity gaps than existing KD techniques.

Paper Structure (47 sections, 18 equations, 5 figures, 10 tables, 1 algorithm)

This paper contains 47 sections, 18 equations, 5 figures, 10 tables, 1 algorithm.

Introduction
Background
Neural Machine Translation
Word-level Knowledge Distillation
Sequence-level Knowledge Distillation
Probing the Knowledge of KD in NMT
Which Information Determines the Performance of Word-level KD?
Can Student Models Really Learn the Correlation Information?
Top-$\bm{k}$ Edit Distance.
Top-$\bm{k}$ Ranking Distance.
Does Knowledge Increase with Top-$\bm{k}$ Information?
Does Top-1 Information Work in All Soft Targets?
Expanding to Sequence-level KD
Rethinking KD in NMT from the Perspective of the Top-1 Information
Top-1 Information Enhanced Knowledge Distillation for NMT
...and 32 more sections

Figures (5)

Figure 1: Removing different information from the original soft targets provided by the teacher during word-level KD. Note that the soft target in "w/o KD" is equivalent to the soft target of label smoothing.
Figure 2: BLEU scores (%) of KD with different information in three intervals of soft targets on the validation set of the WMT'14 En-De task.
Figure 3: Performance of KD techniques with different teacher models on the test set of the WMT'14 En-De task.
Figure 4: BLEU scores of our method with different $k$ on the validation set of the WMT'14 En-De task.
Figure 5: BLEU scores of our method with different iteration times $N$ on the validation set of the WMT'14 En-De task and the corresponding training costs.

Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation

TL;DR

Abstract

Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)