MBR and QE Finetuning: Training-time Distillation of the Best and Most Expensive Decoding Methods
Mara Finkelstein, Subhajit Naskar, Mehdi Mirzazadeh, Apurva Shah, Markus Freitag
TL;DR
This work tackles the decoding-quality mismatch in neural machine translation by distilling the benefits of expensive decoding methods into training-time finetuning. It introduces MBR finetuning and QE finetuning, which generate distillation data from either self or external teacher models using MBR decoding or QE reranking and then train a student model to perform well with efficient inference. Empirical results on English–German and English–Japanese show consistent gains over strong baselines, with especially large improvements when using PaLM-2 Bison as a teacher, sometimes beating finetuning on human references. The findings demonstrate that monolingual data can be leveraged via distillation to achieve high translation quality while maintaining fast decoding, and they open avenues for extending training-time quality gains to other NLG tasks.
Abstract
Recent research in decoding methods for Natural Language Generation (NLG) tasks has shown that MAP decoding is not optimal, because model probabilities do not always align with human preferences. Stronger decoding methods, including Quality Estimation (QE) reranking and Minimum Bayes' Risk (MBR) decoding, have since been proposed to mitigate the model-perplexity-vs-quality mismatch. While these decoding methods achieve state-of-the-art performance, they are prohibitively expensive to compute. In this work, we propose MBR finetuning and QE finetuning which distill the quality gains from these decoding methods at training time, while using an efficient decoding algorithm at inference time. Using the canonical NLG task of Neural Machine Translation (NMT), we show that even with self-training, these finetuning methods significantly outperform the base model. Moreover, when using an external LLM as a teacher model, these finetuning methods outperform finetuning on human-generated references. These findings suggest new ways to leverage monolingual data to achieve improvements in model quality that are on par with, or even exceed, improvements from human-curated data, while maintaining maximum efficiency during decoding.
