Table of Contents
Fetching ...

Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation

Tom Kocmi, Vilém Zouhar, Eleftherios Avramidis, Roman Grundkiewicz, Marzena Karpinska, Maja Popović, Mrinmaya Sachan, Mariya Shmatova

TL;DR

This paper introduces Error Span Annotation (ESA), a human evaluation protocol which combines the continuous rating of DA with the high-level error severity span marking of MQM and shows that ESA offers faster and cheaper annotations than MQM at the same quality level, without the requirement of expensive MQM experts.

Abstract

High-quality Machine Translation (MT) evaluation relies heavily on human judgments. Comprehensive error classification methods, such as Multidimensional Quality Metrics (MQM), are expensive as they are time-consuming and can only be done by experts, whose availability may be limited especially for low-resource languages. On the other hand, just assigning overall scores, like Direct Assessment (DA), is simpler and faster and can be done by translators of any level, but is less reliable. In this paper, we introduce Error Span Annotation (ESA), a human evaluation protocol which combines the continuous rating of DA with the high-level error severity span marking of MQM. We validate ESA by comparing it to MQM and DA for 12 MT systems and one human reference translation (English to German) from WMT23. The results show that ESA offers faster and cheaper annotations than MQM at the same quality level, without the requirement of expensive MQM experts.

Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation

TL;DR

This paper introduces Error Span Annotation (ESA), a human evaluation protocol which combines the continuous rating of DA with the high-level error severity span marking of MQM and shows that ESA offers faster and cheaper annotations than MQM at the same quality level, without the requirement of expensive MQM experts.

Abstract

High-quality Machine Translation (MT) evaluation relies heavily on human judgments. Comprehensive error classification methods, such as Multidimensional Quality Metrics (MQM), are expensive as they are time-consuming and can only be done by experts, whose availability may be limited especially for low-resource languages. On the other hand, just assigning overall scores, like Direct Assessment (DA), is simpler and faster and can be done by translators of any level, but is less reliable. In this paper, we introduce Error Span Annotation (ESA), a human evaluation protocol which combines the continuous rating of DA with the high-level error severity span marking of MQM. We validate ESA by comparing it to MQM and DA for 12 MT systems and one human reference translation (English to German) from WMT23. The results show that ESA offers faster and cheaper annotations than MQM at the same quality level, without the requirement of expensive MQM experts.
Paper Structure (30 sections, 1 equation, 11 figures, 8 tables)

This paper contains 30 sections, 1 equation, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Stylized annotation user interface with Error Span Annotation (ESA). The annotator first marks errors with foored!20 minor and foored!50 major severity and then assigns a final score. This is more robust than asking for score directly.
  • Figure 2: Screenshot of the beginning of one annotated document in the ESA interface (following segments are not shown). By showing and annotating whole documents at the segment-level, the annotators see all the relevant context. Segment reset button and completed labels removed for brevity. See the interactive tutorial shown to all annotators in Appendix \ref{['fig:appraise_tutorial']}.
  • Figure 3: Distribution of scores for one annotation campaign. For ESA, we either use the manual score or ESAspans computation based on error severities. For MQMs, the distribution is clipped $\geq-15$ for higher resolution.
  • Figure 4: Each point represents a system, with the original MQMWMT scores on the y-axis plotted against our rerun of DA+SQMWMT (first plot), ESA (second plot), ESAspans (third plot), and MQM (forth plot). Stripped lines indicate cluster separations determined by each method with alpha threshold 0.05. We compute Spearman correlation $\rho$ and pairwise accuracy $\textsc{Acc}$.
  • Figure 5: Time per segment with respect to progression in the annotation. The faint gray lines represent individual annotators, while the bold black line shows the average time. The lines are smoothed with a window of size 15 segments. We also compute the average speed at the beginning and at the end, which yields the learned speedup. This is how much the annotator speeds up after working on one segment.
  • ...and 6 more figures