Table of Contents
Fetching ...

Post-edits Are Preferences Too

Nathaniel Berger, Miriam Exel, Matthias Huck, Stefan Riezler

TL;DR

This work shows that post-edits, when treated as implicit preferences, can be leveraged within Preference Optimization to guide LLMs toward post-edit–like translations rather than raw MT outputs. By pre-training with supervised fine-tuning on post-edits and then applying deterministic PO (dCPO) or IPO, the authors achieve larger margins between post-edits and machine translations and obtain significant neural-criteria gains on WMT APE data for En→De and En→Ru. The study confirms that PO can be more effective when initialized from SFT, addressing reliability concerns of pairwise MT preferences by leveraging post-edit data. The findings suggest a practical pathway to utilize discarded post-edits to improve translation quality without sacrificing the benefits of traditional reference-like outputs. These insights have implications for training regimes in MT and other generation tasks where post-edit signals are available but not directly usable as references.

Abstract

Preference Optimization (PO) techniques are currently one of the state of the art techniques for fine-tuning large language models (LLMs) on pairwise preference feedback from human annotators. However, in machine translation, this sort of feedback can be difficult to solicit. Additionally, Kreutzer et al. (2018) have shown that, for machine translation, pairwise preferences are less reliable than other forms of human feedback, such as 5-point ratings. We examine post-edits to see if they can be a source of reliable human preferences by construction. In PO, a human annotator is shown sequences $s_1$ and $s_2$ and asked for a preference judgment, %$s_1 > s_2$; while for post-editing, editors create $s_1$ and know that it should be better than $s_2$. We attempt to use these implicit preferences for PO and show that it helps the model move towards post-edit-like hypotheses and away from machine translation-like hypotheses. Furthermore, we show that best results are obtained by pre-training the model with supervised fine-tuning (SFT) on post-edits in order to promote post-edit-like hypotheses to the top output ranks.

Post-edits Are Preferences Too

TL;DR

This work shows that post-edits, when treated as implicit preferences, can be leveraged within Preference Optimization to guide LLMs toward post-edit–like translations rather than raw MT outputs. By pre-training with supervised fine-tuning on post-edits and then applying deterministic PO (dCPO) or IPO, the authors achieve larger margins between post-edits and machine translations and obtain significant neural-criteria gains on WMT APE data for En→De and En→Ru. The study confirms that PO can be more effective when initialized from SFT, addressing reliability concerns of pairwise MT preferences by leveraging post-edit data. The findings suggest a practical pathway to utilize discarded post-edits to improve translation quality without sacrificing the benefits of traditional reference-like outputs. These insights have implications for training regimes in MT and other generation tasks where post-edit signals are available but not directly usable as references.

Abstract

Preference Optimization (PO) techniques are currently one of the state of the art techniques for fine-tuning large language models (LLMs) on pairwise preference feedback from human annotators. However, in machine translation, this sort of feedback can be difficult to solicit. Additionally, Kreutzer et al. (2018) have shown that, for machine translation, pairwise preferences are less reliable than other forms of human feedback, such as 5-point ratings. We examine post-edits to see if they can be a source of reliable human preferences by construction. In PO, a human annotator is shown sequences and and asked for a preference judgment, %; while for post-editing, editors create and know that it should be better than . We attempt to use these implicit preferences for PO and show that it helps the model move towards post-edit-like hypotheses and away from machine translation-like hypotheses. Furthermore, we show that best results are obtained by pre-training the model with supervised fine-tuning (SFT) on post-edits in order to promote post-edit-like hypotheses to the top output ranks.
Paper Structure (12 sections, 7 equations, 5 figures, 8 tables)

This paper contains 12 sections, 7 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: The generative process for preference optimization is that two sequences $s_1$ and $s_2$ are given, and a preference judgment $s_1 > s_2$ is generated (upper graph). The data generating process of post-editing yields reliable preferences by construction: Given $s_2$ and the implicit preference that $s_1 > s_2$, create $s_1$ (lower graph). We propose using the implicit preferences from post-editing for preference optimization.
  • Figure 2: The difference of the models' averaged sequence log-probabilities from the baseline model's on the WMT 2020 En$\rightarrow$De test data. Zero for PE is an average log-probability of $-0.516$ while for MT it is $-0.565$. This violin plot then shows displacement from these baseline values. Dashed horizontal lines indicate quartiles.
  • Figure 3: The difference of the models' averaged sequence log-probabilities from the baseline model's on the WMT 2019 En$\rightarrow$Ru test data. Zero for PE is an average log-probability of $-1.099$ while for MT it is $-1.260$. This violin plot then shows displacement from these baseline values. Dashed horizontal lines indicate quartiles.
  • Figure 4: Here we show the percentage of training examples where the post-edit sequence is preferred in terms of average log-probability over the machine translation for the WMT En$\rightarrow$De dataset. The black lines indicate the 95% confidence intervals for binomial distributed data---non-overlapping confidence intervals indicate a significant difference.
  • Figure 5: Here we show the percentage of training examples where the post-edit sequence is preferred in terms of average log-probability over the machine translation for the WMT En$\rightarrow$Ru dataset.. The black lines indicate the 95% confidence intervals for binomial distributed data---non-overlapping confidence intervals indicate a significant difference.