Table of Contents
Fetching ...

AI-Assisted Human Evaluation of Machine Translation

Vilém Zouhar, Tom Kocmi, Mrinmaya Sachan

TL;DR

This work tackles the high cost of MT human evaluation by introducing ESAAI, a two-stage human-AI collaboration where a recall-focused QE (GEMBA) pre-fills error spans to prime annotators before final scoring. By comparing ESAAI to the traditional ESA protocol on WMT-like English→German data, the study shows ESAAI yields more error spans, faster per-span editing, higher annotator agreement, and potential cost savings of up to ~25% through prefiltering. It also introduces subset-consistency analysis to quantify how many annotations are needed to recover the correct system ranking, demonstrating improved robustness and efficiency for MT system evaluation. The findings highlight practical implications for scalable, reliable MT evaluation with reduced human effort, while acknowledging biases linked to shared underlying LLMs and outlining mitigation strategies.

Abstract

Annually, research teams spend large amounts of money to evaluate the quality of machine translation systems (WMT, inter alia). This is expensive because it requires a lot of expert human labor. In the recently adopted annotation protocol, Error Span Annotation (ESA), annotators mark erroneous parts of the translation and then assign a final score. A lot of the annotator time is spent on scanning the translation for possible errors. In our work, we help the annotators by pre-filling the error annotations with recall-oriented automatic quality estimation. With this AI assistance, we obtain annotations at the same quality level while cutting down the time per span annotation by half (71s/error span $\rightarrow$ 31s/error span). The biggest advantage of the ESA$^\mathrm{AI}$ protocol is an accurate priming of annotators (pre-filled error spans) before they assign the final score. This alleviates a potential automation bias, which we confirm to be low. In our experiments, we find that the annotation budget can be further reduced by almost 25% with filtering of examples that the AI deems to be likely to be correct.

AI-Assisted Human Evaluation of Machine Translation

TL;DR

This work tackles the high cost of MT human evaluation by introducing ESAAI, a two-stage human-AI collaboration where a recall-focused QE (GEMBA) pre-fills error spans to prime annotators before final scoring. By comparing ESAAI to the traditional ESA protocol on WMT-like English→German data, the study shows ESAAI yields more error spans, faster per-span editing, higher annotator agreement, and potential cost savings of up to ~25% through prefiltering. It also introduces subset-consistency analysis to quantify how many annotations are needed to recover the correct system ranking, demonstrating improved robustness and efficiency for MT system evaluation. The findings highlight practical implications for scalable, reliable MT evaluation with reduced human effort, while acknowledging biases linked to shared underlying LLMs and outlining mitigation strategies.

Abstract

Annually, research teams spend large amounts of money to evaluate the quality of machine translation systems (WMT, inter alia). This is expensive because it requires a lot of expert human labor. In the recently adopted annotation protocol, Error Span Annotation (ESA), annotators mark erroneous parts of the translation and then assign a final score. A lot of the annotator time is spent on scanning the translation for possible errors. In our work, we help the annotators by pre-filling the error annotations with recall-oriented automatic quality estimation. With this AI assistance, we obtain annotations at the same quality level while cutting down the time per span annotation by half (71s/error span 31s/error span). The biggest advantage of the ESA protocol is an accurate priming of annotators (pre-filled error spans) before they assign the final score. This alleviates a potential automation bias, which we confirm to be low. In our experiments, we find that the annotation budget can be further reduced by almost 25% with filtering of examples that the AI deems to be likely to be correct.
Paper Structure (29 sections, 6 equations, 12 figures, 9 tables)

This paper contains 29 sections, 6 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: The pipeline (top) and annotation user interface (bottom) with Error Span Annotation pre-filled with AI. In the example, the user: (1) lowered the severity of the gender agreement error, (2) removed incorrectly marked error span, and (3) assigned the final score.
  • Figure 2: Overview of inputs and outputs of various MT evaluation approaches. Quality estimation (QE) is automated and produces for each segment either a single score or a list of errors. DA+SQM, MQM, ESA and ESAAI are human annotation protocols. ESAAI (this paper) is semi-automated and happens in two-steps: quality estimation pre-annotation and human annotation.
  • Figure 3: Number of removed/kept/added error spans from the QE system with respect to annotator progress. The amount and type of work remains constant.
  • Figure 4: Annotation actions (remove/keep/add an error span) and time per segment. Each dot and bar is an annotator (sorted by time).
  • Figure 5: Time per segment with respect to progression in the annontation. Each annotator is the gray faint line and their average is in black. The lines are smoothed with a window of size 15 segments. We also compute the average speed at the beginning and at the end, which yields the learned speedup. This is how much the annotator speeds up per working on one segment.
  • ...and 7 more figures