Table of Contents
Fetching ...

Aligning Neural Machine Translation Models: Human Feedback in Training and Inference

Miguel Moura Ramos, Patrick Fernandes, António Farinhas, André F. T. Martins

TL;DR

This work investigates aligning neural machine translation with human preferences by integrating quality metrics as reward models at three points in the MT pipeline: data filtering, reinforcement learning based training, and decoding-time reranking. The authors demonstrate that neural quality estimators, particularly COMET-QE, yield more stable RL training and better translation quality than BLEU-based rewards, and that data filtering can significantly stabilize learning. They show that RL training often outperforms traditional MLE and that combining RL with reranking can yield further gains across multiple evaluation metrics, though with tradeoffs in efficiency. The findings suggest that neural reward models enable more human-aligned MT and point toward potential unsupervised training opportunities using quality estimation signals.

Abstract

Reinforcement learning from human feedback (RLHF) is a recent technique to improve the quality of the text generated by a language model, making it closer to what humans would generate. A core ingredient in RLHF's success in aligning and improving large language models (LLMs) is its reward model, trained using human feedback on model outputs. In machine translation (MT), where metrics trained from human annotations can readily be used as reward models, recent methods using minimum Bayes risk decoding and reranking have succeeded in improving the final quality of translation. In this study, we comprehensively explore and compare techniques for integrating quality metrics as reward models into the MT pipeline. This includes using the reward model for data filtering, during the training phase through RL, and at inference time by employing reranking techniques, and we assess the effects of combining these in a unified approach. Our experimental results, conducted across multiple translation tasks, underscore the crucial role of effective data filtering, based on estimated quality, in harnessing the full potential of RL in enhancing MT quality. Furthermore, our findings demonstrate the effectiveness of combining RL training with reranking techniques, showcasing substantial improvements in translation quality.

Aligning Neural Machine Translation Models: Human Feedback in Training and Inference

TL;DR

This work investigates aligning neural machine translation with human preferences by integrating quality metrics as reward models at three points in the MT pipeline: data filtering, reinforcement learning based training, and decoding-time reranking. The authors demonstrate that neural quality estimators, particularly COMET-QE, yield more stable RL training and better translation quality than BLEU-based rewards, and that data filtering can significantly stabilize learning. They show that RL training often outperforms traditional MLE and that combining RL with reranking can yield further gains across multiple evaluation metrics, though with tradeoffs in efficiency. The findings suggest that neural reward models enable more human-aligned MT and point toward potential unsupervised training opportunities using quality estimation signals.

Abstract

Reinforcement learning from human feedback (RLHF) is a recent technique to improve the quality of the text generated by a language model, making it closer to what humans would generate. A core ingredient in RLHF's success in aligning and improving large language models (LLMs) is its reward model, trained using human feedback on model outputs. In machine translation (MT), where metrics trained from human annotations can readily be used as reward models, recent methods using minimum Bayes risk decoding and reranking have succeeded in improving the final quality of translation. In this study, we comprehensively explore and compare techniques for integrating quality metrics as reward models into the MT pipeline. This includes using the reward model for data filtering, during the training phase through RL, and at inference time by employing reranking techniques, and we assess the effects of combining these in a unified approach. Our experimental results, conducted across multiple translation tasks, underscore the crucial role of effective data filtering, based on estimated quality, in harnessing the full potential of RL in enhancing MT quality. Furthermore, our findings demonstrate the effectiveness of combining RL training with reranking techniques, showcasing substantial improvements in translation quality.
Paper Structure (20 sections, 4 equations, 7 figures, 4 tables)

This paper contains 20 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Preference models can have multifaceted roles within the MT pipeline. They can serve as effective data filters, refining datasets by incorporating user preferences. They can also assume a pivotal role in classic RL training by providing rewards to optimize the MT model performance. Finally, they can act as rerankers during the decoding phase, selecting the final translation by maximizing their scores derived from user preferences.
  • Figure 2: These models were fine-tuned by progressively increasing the size of the high-quality subset, obtained with COMET-QE sentence reranking and denoted in increments of 100,000.
  • Figure 3: Number of hallucinations on the WMT16 EN→DE test set with $3000$ sentences.
  • Figure 4: Comparison of BLEU (top) and BLEURT (bottom) scores for WMT16 EN→DE translations across diverse source sentence lengths, highlighting the influence of sentence length on translation quality.
  • Figure 5: Histograms of sentence BLEU scores for the specific systems on WMT16 EN→DE.
  • ...and 2 more figures