Don't Throw Away Data: Better Sequence Knowledge Distillation
Jun Wang, Eleftheria Briakou, Hamid Dadkhahi, Rishabh Agarwal, Colin Cherry, Trevor Cohn
TL;DR
This work introduces MBR-$n$, a knowledge distillation strategy that leverages multiple high-scoring MBR translations as supervision rather than a single best output. By training on top-N MBR candidates, the student better approximates the teacher’s distribution, yielding consistent gains across English–German and English–Japanese tasks with PaLM2 models. The approach demonstrates improved data efficiency, supports curriculum-based staged training, and remains robust across in-domain and out-of-domain evaluations, while analyzing effects of model capacity and output diversity. Overall, MBR-$n$ provides a practical, scalable enhancement to sequence-level KD for translation with large language models.
Abstract
A critical component in knowledge distillation is the means of coupling the teacher and student. The predominant sequence knowledge distillation method involves supervised learning of the student against teacher-decoded outputs, and is exemplified by the current state of the art, which incorporates minimum Bayes risk (MBR) decoding. In this paper we seek to integrate MBR more tightly in distillation training, specifically by using several high scoring MBR translations, rather than a single selected sequence, thus capturing a rich diversity of teacher outputs. Our experiments on English to German and English to Japanese translation show consistent improvements over strong baseline methods for both tasks and with varying model sizes. Additionally, we conduct a detailed analysis focusing on data efficiency and capacity curse aspects to elucidate MBR-n and explore its further potential.
