Table of Contents
Fetching ...

MBR and QE Finetuning: Training-time Distillation of the Best and Most Expensive Decoding Methods

Mara Finkelstein, Subhajit Naskar, Mehdi Mirzazadeh, Apurva Shah, Markus Freitag

TL;DR

This work tackles the decoding-quality mismatch in neural machine translation by distilling the benefits of expensive decoding methods into training-time finetuning. It introduces MBR finetuning and QE finetuning, which generate distillation data from either self or external teacher models using MBR decoding or QE reranking and then train a student model to perform well with efficient inference. Empirical results on English–German and English–Japanese show consistent gains over strong baselines, with especially large improvements when using PaLM-2 Bison as a teacher, sometimes beating finetuning on human references. The findings demonstrate that monolingual data can be leveraged via distillation to achieve high translation quality while maintaining fast decoding, and they open avenues for extending training-time quality gains to other NLG tasks.

Abstract

Recent research in decoding methods for Natural Language Generation (NLG) tasks has shown that MAP decoding is not optimal, because model probabilities do not always align with human preferences. Stronger decoding methods, including Quality Estimation (QE) reranking and Minimum Bayes' Risk (MBR) decoding, have since been proposed to mitigate the model-perplexity-vs-quality mismatch. While these decoding methods achieve state-of-the-art performance, they are prohibitively expensive to compute. In this work, we propose MBR finetuning and QE finetuning which distill the quality gains from these decoding methods at training time, while using an efficient decoding algorithm at inference time. Using the canonical NLG task of Neural Machine Translation (NMT), we show that even with self-training, these finetuning methods significantly outperform the base model. Moreover, when using an external LLM as a teacher model, these finetuning methods outperform finetuning on human-generated references. These findings suggest new ways to leverage monolingual data to achieve improvements in model quality that are on par with, or even exceed, improvements from human-curated data, while maintaining maximum efficiency during decoding.

MBR and QE Finetuning: Training-time Distillation of the Best and Most Expensive Decoding Methods

TL;DR

This work tackles the decoding-quality mismatch in neural machine translation by distilling the benefits of expensive decoding methods into training-time finetuning. It introduces MBR finetuning and QE finetuning, which generate distillation data from either self or external teacher models using MBR decoding or QE reranking and then train a student model to perform well with efficient inference. Empirical results on English–German and English–Japanese show consistent gains over strong baselines, with especially large improvements when using PaLM-2 Bison as a teacher, sometimes beating finetuning on human references. The findings demonstrate that monolingual data can be leveraged via distillation to achieve high translation quality while maintaining fast decoding, and they open avenues for extending training-time quality gains to other NLG tasks.

Abstract

Recent research in decoding methods for Natural Language Generation (NLG) tasks has shown that MAP decoding is not optimal, because model probabilities do not always align with human preferences. Stronger decoding methods, including Quality Estimation (QE) reranking and Minimum Bayes' Risk (MBR) decoding, have since been proposed to mitigate the model-perplexity-vs-quality mismatch. While these decoding methods achieve state-of-the-art performance, they are prohibitively expensive to compute. In this work, we propose MBR finetuning and QE finetuning which distill the quality gains from these decoding methods at training time, while using an efficient decoding algorithm at inference time. Using the canonical NLG task of Neural Machine Translation (NMT), we show that even with self-training, these finetuning methods significantly outperform the base model. Moreover, when using an external LLM as a teacher model, these finetuning methods outperform finetuning on human-generated references. These findings suggest new ways to leverage monolingual data to achieve improvements in model quality that are on par with, or even exceed, improvements from human-curated data, while maintaining maximum efficiency during decoding.
Paper Structure (55 sections, 2 equations, 4 figures, 20 tables)

This paper contains 55 sections, 2 equations, 4 figures, 20 tables.

Figures (4)

  • Figure 1: QE vs MBR scores for all 256 candidate translations generated from a single source sentence in the en$\rightarrow$de newstest2009-2019 dataset. The ranking of examples by QE score is not the same as the ranking by MBR score.
  • Figure 2: QE and MBR score distributions for all candidates versus the top-1 candidate (for candidate_size=256), generated from the (source-side) en$\rightarrow$de NewsTest 2009-2019 finetuning set.
  • Figure 3: Cross-BLEU of en$\rightarrow$de models on WMT'22 test set, as a measure of model similarity.
  • Figure 4: Finetuned model performance as a function of the number of candidates per source used to generate the QE dataset. Performance of the QE-reranked teacher model improves as the candidate size increases, while the QE-finetuned student is more robust to candidate size, and outperforms the teacher at all candidate sizes. The lower bound of beam search decoding from the base model is also shown for perspective.