Table of Contents
Fetching ...

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation

Parker Riley, Daniel Deutsch, Mara Finkelstein, Colten DiIanni, Juraj Juraska, Markus Freitag

TL;DR

MQM re-annotation addresses evaluation noise in machine translation by introducing a two-stage framework where existing MQM annotations are reviewed and edited by an expert, optionally with automatic priors. The study investigates how re-annotation affects rater behavior, the detection of artificial error spans, and the quality of annotations when re-annotating human versus automatic priors. Results show that re-annotation improves annotation quality, increases inter-annotator agreement, and that high-quality automatic priors can boost quality without additional human cost. This approach yields higher-quality ground truth for MT evaluation and can support more reliable benchmarking and meta-evaluation of translation quality.

Abstract

Human evaluation of machine translation is in an arms race with translation model quality: as our models get better, our evaluation methods need to be improved to ensure that quality gains are not lost in evaluation noise. To this end, we experiment with a two-stage version of the current state-of-the-art translation evaluation paradigm (MQM), which we call MQM re-annotation. In this setup, an MQM annotator reviews and edits a set of pre-existing MQM annotations, that may have come from themselves, another human annotator, or an automatic MQM annotation system. We demonstrate that rater behavior in re-annotation aligns with our goals, and that re-annotation results in higher-quality annotations, mostly due to finding errors that were missed during the first pass.

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation

TL;DR

MQM re-annotation addresses evaluation noise in machine translation by introducing a two-stage framework where existing MQM annotations are reviewed and edited by an expert, optionally with automatic priors. The study investigates how re-annotation affects rater behavior, the detection of artificial error spans, and the quality of annotations when re-annotating human versus automatic priors. Results show that re-annotation improves annotation quality, increases inter-annotator agreement, and that high-quality automatic priors can boost quality without additional human cost. This approach yields higher-quality ground truth for MT evaluation and can support more reliable benchmarking and meta-evaluation of translation quality.

Abstract

Human evaluation of machine translation is in an arms race with translation model quality: as our models get better, our evaluation methods need to be improved to ensure that quality gains are not lost in evaluation noise. To this end, we experiment with a two-stage version of the current state-of-the-art translation evaluation paradigm (MQM), which we call MQM re-annotation. In this setup, an MQM annotator reviews and edits a set of pre-existing MQM annotations, that may have come from themselves, another human annotator, or an automatic MQM annotation system. We demonstrate that rater behavior in re-annotation aligns with our goals, and that re-annotation results in higher-quality annotations, mostly due to finding errors that were missed during the first pass.

Paper Structure

This paper contains 18 sections, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Illustration of MQM re-annotation. A source document and its translation are annotated in the MQM framework by either a human or an automatic system, and then re-annotated by a human. If the initial annotator was a human, the re-annotator can be either the same person or a different one.
  • Figure 2: Anthea interface showcasing prior error annotations.