Table of Contents
Fetching ...

To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

Tom Kocmi, Christian Federmann, Roman Grundkiewicz, Marcin Junczys-Dowmunt, Hitokazu Matsushita, Arul Menezes

TL;DR

The paper addresses the reliability of automatic MT evaluation metrics for pairwise system ranking, highlighting that traditional metrics like BLEU can mislead development. It collects a large-scale corpus of human judgments (2.3M judgments across 4,380 systems and 232 directions) and systematically compares string-based and pretrained metrics using a pairwise accuracy framework, revealing pretrained metrics (notably COMET and COMET-src) outperform string-based ones, with ChrF remaining the best string-based option. The authors demonstrate that statistical significance testing enhances decision reliability and provide practical best-practice guidelines, including using COMET as the primary metric and publishing translated outputs for replication. The work delivers a valuable dataset and a robust evaluation framework that can steer MT metric development and evaluation toward more faithful human-aligned judgments across languages and domains.

Abstract

Automatic metrics are commonly used as the exclusive tool for declaring the superiority of one machine translation system's quality over another. The community choice of automatic metric guides research directions and industrial developments by deciding which models are deemed better. Evaluating metrics correlations with sets of human judgements has been limited by the size of these sets. In this paper, we corroborate how reliable metrics are in contrast to human judgements on -- to the best of our knowledge -- the largest collection of judgements reported in the literature. Arguably, pairwise rankings of two systems are the most common evaluation tasks in research or deployment scenarios. Taking human judgement as a gold standard, we investigate which metrics have the highest accuracy in predicting translation quality rankings for such system pairs. Furthermore, we evaluate the performance of various metrics across different language pairs and domains. Lastly, we show that the sole use of BLEU impeded the development of improved models leading to bad deployment decisions. We release the collection of 2.3M sentence-level human judgements for 4380 systems for further analysis and replication of our work.

To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

TL;DR

The paper addresses the reliability of automatic MT evaluation metrics for pairwise system ranking, highlighting that traditional metrics like BLEU can mislead development. It collects a large-scale corpus of human judgments (2.3M judgments across 4,380 systems and 232 directions) and systematically compares string-based and pretrained metrics using a pairwise accuracy framework, revealing pretrained metrics (notably COMET and COMET-src) outperform string-based ones, with ChrF remaining the best string-based option. The authors demonstrate that statistical significance testing enhances decision reliability and provide practical best-practice guidelines, including using COMET as the primary metric and publishing translated outputs for replication. The work delivers a valuable dataset and a robust evaluation framework that can steer MT metric development and evaluation toward more faithful human-aligned judgments across languages and domains.

Abstract

Automatic metrics are commonly used as the exclusive tool for declaring the superiority of one machine translation system's quality over another. The community choice of automatic metric guides research directions and industrial developments by deciding which models are deemed better. Evaluating metrics correlations with sets of human judgements has been limited by the size of these sets. In this paper, we corroborate how reliable metrics are in contrast to human judgements on -- to the best of our knowledge -- the largest collection of judgements reported in the literature. Arguably, pairwise rankings of two systems are the most common evaluation tasks in research or deployment scenarios. Taking human judgement as a gold standard, we investigate which metrics have the highest accuracy in predicting translation quality rankings for such system pairs. Furthermore, we evaluate the performance of various metrics across different language pairs and domains. Lastly, we show that the sole use of BLEU impeded the development of improved models leading to bad deployment decisions. We release the collection of 2.3M sentence-level human judgements for 4380 systems for further analysis and replication of our work.

Paper Structure

This paper contains 20 sections, 2 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Each point represents a difference in average human judgement (y-axis) and a difference in automatic metric (x-axis) over a pair of systems. Blue points are system pairs translating from English; green points into English; red points are non-English system pairs (a few French, German, or Chinese-centric system pairs). We report Spearman's correlation in the top left corner and Pearson's r in the bottom right corner. Metrics disagree with human ranking for system pairs in pink quadrants. Other metrics are in \ref{['app:metric_deltas']} in the Appendix.
  • Figure 2: Each point represents a difference in average human judgement (y-axis) and a difference in automatic metric (x-axis) over a pair of systems. Blue points are system pairs translating from English; green points are into English; red points are non-English systems (French, German, and Chinese centric). Spearman's correlation is in top left corner, while Pearson's r is in the bottom right corner. Metrics disagree with human ranking for system pairs in pink quadrants. For better visualization, we have clipped few outliers in BLEU, ChrF, and TER plots.
  • Figure 3: Each row represents accuracy of system pairs for given language pair. We list language pairs with at least 20 system pairs. Results are calculated over a set of significantly different system pairs with alpha level 0.05. Results with grey background are considered to be tied with the best metric. Interestingly, when we investigated Polish--English results, we found out the test set is likely post-edited MT output.