Table of Contents
Fetching ...

Enhancing Human Evaluation in Machine Translation with Comparative Judgment

Yixiao Song, Parker Riley, Daniel Deutsch, Markus Freitag

TL;DR

The study examines how annotation design shapes human MT evaluation and compares three settings—MQM, side-by-side MQM, and side-by-side relative ranking—across ZhEn and EnDe. By leveraging comparative judgment, it shows that s×s MQM improves inter-annotator agreement and inter-translation consistency, while s×s RR provides a scalable alternative for system ranking. The results indicate stable system rankings across settings, with trade-offs in detecting subtle differences versus efficiency. The authors also release triply annotated ZhEn and EnDe datasets to spur further research in MT evaluation methodologies.

Abstract

Human evaluation is crucial for assessing rapidly evolving language models but is influenced by annotator proficiency and task design. This study explores the integration of comparative judgment into human annotation for machine translation (MT) and evaluates three annotation setups-point-wise Multidimensional Quality Metrics (MQM), side-by-side (SxS) MQM, and its simplified version SxS relative ranking (RR). In MQM, annotators mark error spans with categories and severity levels. SxS MQM extends MQM to pairwise error annotation for two translations of the same input, while SxS RR focuses on selecting the better output without labeling errors. Key findings are: (1) the SxS settings achieve higher inter-annotator agreement than MQM; (2) SxS MQM enhances inter-translation error marking consistency compared to MQM by, on average, 38.5% for explicitly compared MT systems and 19.5% for others; (3) all annotation settings return stable system rankings, with SxS RR offering a more efficient alternative to (SxS) MQM; (4) the SxS settings highlight subtle errors overlooked in MQM without altering absolute system evaluations. To spur further research, we will release the triply annotated datasets comprising 377 ZhEn and 104 EnDe annotation examples.

Enhancing Human Evaluation in Machine Translation with Comparative Judgment

TL;DR

The study examines how annotation design shapes human MT evaluation and compares three settings—MQM, side-by-side MQM, and side-by-side relative ranking—across ZhEn and EnDe. By leveraging comparative judgment, it shows that s×s MQM improves inter-annotator agreement and inter-translation consistency, while s×s RR provides a scalable alternative for system ranking. The results indicate stable system rankings across settings, with trade-offs in detecting subtle differences versus efficiency. The authors also release triply annotated ZhEn and EnDe datasets to spur further research in MT evaluation methodologies.

Abstract

Human evaluation is crucial for assessing rapidly evolving language models but is influenced by annotator proficiency and task design. This study explores the integration of comparative judgment into human annotation for machine translation (MT) and evaluates three annotation setups-point-wise Multidimensional Quality Metrics (MQM), side-by-side (SxS) MQM, and its simplified version SxS relative ranking (RR). In MQM, annotators mark error spans with categories and severity levels. SxS MQM extends MQM to pairwise error annotation for two translations of the same input, while SxS RR focuses on selecting the better output without labeling errors. Key findings are: (1) the SxS settings achieve higher inter-annotator agreement than MQM; (2) SxS MQM enhances inter-translation error marking consistency compared to MQM by, on average, 38.5% for explicitly compared MT systems and 19.5% for others; (3) all annotation settings return stable system rankings, with SxS RR offering a more efficient alternative to (SxS) MQM; (4) the SxS settings highlight subtle errors overlooked in MQM without altering absolute system evaluations. To spur further research, we will release the triply annotated datasets comprising 377 ZhEn and 104 EnDe annotation examples.

Paper Structure

This paper contains 29 sections, 1 equation, 5 figures, 16 tables.

Figures (5)

  • Figure 1: Illustration of the three annotation settings studied in this work (\ref{['sec:three_studied_settings']}). The grey-highlighted text is the segment to be annotated within their context. In single-sided and side-by-side MQM, annotators mark error spans and assign error category with severity. The score of a segment/document is determined by the category and severity of its error(s). In side-by-side relative ranking, annotators read two translations and choose the (much) better side or decide if they tie, without labelling errors. The scoring scheme of each setting is in \ref{['sec:result_score_calc']}.
  • Figure 2: Percentages of error categories in the MQM settings in ZhEn and EnDe. The GPT4-5shot errors in EnDe MQM are doubled for a fair comparison with EnDes$\times$s MQM. While the percentages in EnDe stay relatively stable, in ZhEn, accuracy errors have a higher percentage in s$\times$s MQM than in MQM.
  • Figure 3: Number of errors in five categories in the MQM settings in ZhEn and EnDe of all three rounds of annotations. Others includes non-translation, locale convention, and other.
  • Figure 4: Violin plots of the original segment scores contributed by each annotator (without z-normalization). Annotator identities are omitted for anonymity. The dots indicate the mean, while the crosses represent the median of each distribution.
  • Figure 5: Error category conversion from MQM to s$\times$s MQM in (a) ZhEn and (b) EnDe of the same errors annotated by the annotators in both MQM and s$\times$s MQM. MQM EnDe GPT4-5shot is duplicated for the comparison.