MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation
Parker Riley, Daniel Deutsch, Mara Finkelstein, Colten DiIanni, Juraj Juraska, Markus Freitag
TL;DR
MQM re-annotation addresses evaluation noise in machine translation by introducing a two-stage framework where existing MQM annotations are reviewed and edited by an expert, optionally with automatic priors. The study investigates how re-annotation affects rater behavior, the detection of artificial error spans, and the quality of annotations when re-annotating human versus automatic priors. Results show that re-annotation improves annotation quality, increases inter-annotator agreement, and that high-quality automatic priors can boost quality without additional human cost. This approach yields higher-quality ground truth for MT evaluation and can support more reliable benchmarking and meta-evaluation of translation quality.
Abstract
Human evaluation of machine translation is in an arms race with translation model quality: as our models get better, our evaluation methods need to be improved to ensure that quality gains are not lost in evaluation noise. To this end, we experiment with a two-stage version of the current state-of-the-art translation evaluation paradigm (MQM), which we call MQM re-annotation. In this setup, an MQM annotator reviews and edits a set of pre-existing MQM annotations, that may have come from themselves, another human annotator, or an automatic MQM annotation system. We demonstrate that rater behavior in re-annotation aligns with our goals, and that re-annotation results in higher-quality annotations, mostly due to finding errors that were missed during the first pass.
