Table of Contents
Fetching ...

Self-Evolution Knowledge Distillation for LLM-based Machine Translation

Yuncheng Song, Liang Ding, Changtong Zan, Shujian Huang

TL;DR

This work tackles the inefficiency of uniform knowledge distillation for LLM-based machine translation by introducing Self-Evolution KD, a token-aware, two-stage distillation framework. Stage 1 computes per-token learning difficulty via a KL divergence between a target distribution that blends ground-truth and teacher knowledge and the student output, enabling dynamic token categorization into hard and easy groups. Stage 2 adjusts the distillation signal by mixing in prior knowledge for hard tokens through a proxy distribution while leaving easy tokens unchanged, leading to faster convergence and stronger transfer. Empirical results on WMT22 with Llama-7B/13B show consistent BLEU gains (average around +1.4 BLEU) over baselines and competitive performance relative to teacher models, demonstrating the practical impact of token-level prior knowledge in KD for MT.

Abstract

Knowledge distillation (KD) has shown great promise in transferring knowledge from larger teacher models to smaller student models. However, existing KD strategies for large language models often minimize output distributions between student and teacher models indiscriminately for each token. This overlooks the imbalanced nature of tokens and their varying transfer difficulties. In response, we propose a distillation strategy called Self-Evolution KD. The core of this approach involves dynamically integrating teacher distribution and one-hot distribution of ground truth into the student distribution as prior knowledge, which promotes the distillation process. It adjusts the ratio of prior knowledge based on token learning difficulty, fully leveraging the teacher model's potential. Experimental results show our method brings an average improvement of approximately 1.4 SacreBLEU points across four translation directions in the WMT22 test sets. Further analysis indicates that the improvement comes from better knowledge transfer from teachers, confirming our hypothesis.

Self-Evolution Knowledge Distillation for LLM-based Machine Translation

TL;DR

This work tackles the inefficiency of uniform knowledge distillation for LLM-based machine translation by introducing Self-Evolution KD, a token-aware, two-stage distillation framework. Stage 1 computes per-token learning difficulty via a KL divergence between a target distribution that blends ground-truth and teacher knowledge and the student output, enabling dynamic token categorization into hard and easy groups. Stage 2 adjusts the distillation signal by mixing in prior knowledge for hard tokens through a proxy distribution while leaving easy tokens unchanged, leading to faster convergence and stronger transfer. Empirical results on WMT22 with Llama-7B/13B show consistent BLEU gains (average around +1.4 BLEU) over baselines and competitive performance relative to teacher models, demonstrating the practical impact of token-level prior knowledge in KD for MT.

Abstract

Knowledge distillation (KD) has shown great promise in transferring knowledge from larger teacher models to smaller student models. However, existing KD strategies for large language models often minimize output distributions between student and teacher models indiscriminately for each token. This overlooks the imbalanced nature of tokens and their varying transfer difficulties. In response, we propose a distillation strategy called Self-Evolution KD. The core of this approach involves dynamically integrating teacher distribution and one-hot distribution of ground truth into the student distribution as prior knowledge, which promotes the distillation process. It adjusts the ratio of prior knowledge based on token learning difficulty, fully leveraging the teacher model's potential. Experimental results show our method brings an average improvement of approximately 1.4 SacreBLEU points across four translation directions in the WMT22 test sets. Further analysis indicates that the improvement comes from better knowledge transfer from teachers, confirming our hypothesis.

Paper Structure

This paper contains 28 sections, 10 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overall framework of our Self-Evolution KD. It mainly contains two stages: ① self-question: calculating the learning difficulty by the KL divergence between the student distribution and target distribution, and dividing tokens into different categories. comparison means comparing the learning difficulty with the preset threshold $\Gamma$; ② self-evolution: building proxy distribution for different tokens by smoothing the target and student distributions. Updated represents that the parameter needs to be updated, while Frozen means not.
  • Figure 2: \ref{['fig:lossthres']} and \ref{['fig:selectratio']}: Effect of $\Gamma$ and percent ($K$) for selecting hard-to-learn tokens. \ref{['fig:exposureratio']}: Effect of $\beta$ to determine the mixsture proportion of prior knowledge. We report their average SacreBLEU points on the above-mentioned validation dataset in \ref{['fig:lossthres']} and \ref{['fig:exposureratio']}. As for \ref{['fig:selectratio']}, the average SacreBLEU points on WMT22 test sets are reported since we compare different distillation strategies.
  • Figure 3: Comparison of static strategy and progressive strategy for factor $\beta$. "0.5$\to$0.0" means the $\beta_b$ is 0.5 and the $\beta_e$ is 0.0. static strategy (0.5) indicates the results of Self-Evolution KD ($\beta$ = 0.5).
  • Figure 4: Comparison of teacher’s knowledge transfer across different distillation strategies.
  • Figure 5: Effect of the loss weight of the SKEW KD (Teacher). We only report the Self-Evolution KD for reference.