Table of Contents
Fetching ...

Don't Throw Away Data: Better Sequence Knowledge Distillation

Jun Wang, Eleftheria Briakou, Hamid Dadkhahi, Rishabh Agarwal, Colin Cherry, Trevor Cohn

TL;DR

This work introduces MBR-$n$, a knowledge distillation strategy that leverages multiple high-scoring MBR translations as supervision rather than a single best output. By training on top-N MBR candidates, the student better approximates the teacher’s distribution, yielding consistent gains across English–German and English–Japanese tasks with PaLM2 models. The approach demonstrates improved data efficiency, supports curriculum-based staged training, and remains robust across in-domain and out-of-domain evaluations, while analyzing effects of model capacity and output diversity. Overall, MBR-$n$ provides a practical, scalable enhancement to sequence-level KD for translation with large language models.

Abstract

A critical component in knowledge distillation is the means of coupling the teacher and student. The predominant sequence knowledge distillation method involves supervised learning of the student against teacher-decoded outputs, and is exemplified by the current state of the art, which incorporates minimum Bayes risk (MBR) decoding. In this paper we seek to integrate MBR more tightly in distillation training, specifically by using several high scoring MBR translations, rather than a single selected sequence, thus capturing a rich diversity of teacher outputs. Our experiments on English to German and English to Japanese translation show consistent improvements over strong baseline methods for both tasks and with varying model sizes. Additionally, we conduct a detailed analysis focusing on data efficiency and capacity curse aspects to elucidate MBR-n and explore its further potential.

Don't Throw Away Data: Better Sequence Knowledge Distillation

TL;DR

This work introduces MBR-, a knowledge distillation strategy that leverages multiple high-scoring MBR translations as supervision rather than a single best output. By training on top-N MBR candidates, the student better approximates the teacher’s distribution, yielding consistent gains across English–German and English–Japanese tasks with PaLM2 models. The approach demonstrates improved data efficiency, supports curriculum-based staged training, and remains robust across in-domain and out-of-domain evaluations, while analyzing effects of model capacity and output diversity. Overall, MBR- provides a practical, scalable enhancement to sequence-level KD for translation with large language models.

Abstract

A critical component in knowledge distillation is the means of coupling the teacher and student. The predominant sequence knowledge distillation method involves supervised learning of the student against teacher-decoded outputs, and is exemplified by the current state of the art, which incorporates minimum Bayes risk (MBR) decoding. In this paper we seek to integrate MBR more tightly in distillation training, specifically by using several high scoring MBR translations, rather than a single selected sequence, thus capturing a rich diversity of teacher outputs. Our experiments on English to German and English to Japanese translation show consistent improvements over strong baseline methods for both tasks and with varying model sizes. Additionally, we conduct a detailed analysis focusing on data efficiency and capacity curse aspects to elucidate MBR-n and explore its further potential.
Paper Structure (30 sections, 3 equations, 5 figures, 5 tables)

This paper contains 30 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparing a pre-trained student (a) versus one fine-tuned for translation (b). Here a XXXS student is trained against a XXS teacher on en-de. Reported in the caption are the BLEURT scores for the student models before KD training; the accuracy of the teacher is 0.7552, as reported in Table \ref{['tab:teacher-scores']}. The yellow line shows the effect of training on $5\ldots40$ random samples.
  • Figure 2: Staged training for en-de translation, where student is trained in a two stage curriculum against different teachers.
  • Figure 3: MBR-$n$ is more data efficient than baseline methods, in terms of the volume of distillation training data required. Shown above are results for English to German translation with two student/teacher configurations. KD instances are measured in thousands of sentences, with the rightmost 30k setting corresponding to the complete KD dataset.
  • Figure 4: Diversity of outputs measured using self-bleu. Two settings are illustrated: student PaLM2-XXXS trained with teacher PaLM2-XXS and student PaLM2-XXS trained with teacher PaLM2-XS, on English to German translation task.
  • Figure 5: Candidate selection by temperature sampling, de-en translation.