APE at Scale and its Implications on MT Evaluation Biases
Markus Freitag, Isaac Caswell, Scott Roy
TL;DR
This paper investigates biases in MT evaluation introduced by translationese and proposes a scalable Automatic Post-Editing (APE) approach trained exclusively on round-trip translation (RTT) synthetic data to post-edit NMT outputs. It demonstrates that APE can improve human judgments and achieve BLEU gains on several language pairs, though BLEU can both rise and fall depending on whether test-tset halves originate from translated sources or original text. The work highlights biases in standard BLEU when evaluating translationese-heavy test sets and advocates reporting results separately for source-original and target-original halves, along with higher-quality, multi-reference test sets. It also shows that APE is agnostic to the underlying MT system and that surprisingly small RTT data (around 24 million sentences) can suffice for strong post-editing performance.
Abstract
In this work, we train an Automatic Post-Editing (APE) model and use it to reveal biases in standard Machine Translation (MT) evaluation procedures. The goal of our APE model is to correct typical errors introduced by the translation process, and convert the "translationese" output into natural text. Our APE model is trained entirely on monolingual data that has been round-trip translated through English, to mimic errors that are similar to the ones introduced by NMT. We apply our model to the output of existing NMT systems, and demonstrate that, while the human-judged quality improves in all cases, BLEU scores drop with forward-translated test sets. We verify these results for the WMT18 English to German, WMT15 English to French, and WMT16 English to Romanian tasks. Furthermore, we selectively apply our APE model on the output of the top submissions of the most recent WMT evaluation campaigns. We see quality improvements on all tasks of up to 2.5 BLEU points.
