To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

Tom Kocmi; Christian Federmann; Roman Grundkiewicz; Marcin Junczys-Dowmunt; Hitokazu Matsushita; Arul Menezes

To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

Tom Kocmi, Christian Federmann, Roman Grundkiewicz, Marcin Junczys-Dowmunt, Hitokazu Matsushita, Arul Menezes

TL;DR

The paper addresses the reliability of automatic MT evaluation metrics for pairwise system ranking, highlighting that traditional metrics like BLEU can mislead development. It collects a large-scale corpus of human judgments (2.3M judgments across 4,380 systems and 232 directions) and systematically compares string-based and pretrained metrics using a pairwise accuracy framework, revealing pretrained metrics (notably COMET and COMET-src) outperform string-based ones, with ChrF remaining the best string-based option. The authors demonstrate that statistical significance testing enhances decision reliability and provide practical best-practice guidelines, including using COMET as the primary metric and publishing translated outputs for replication. The work delivers a valuable dataset and a robust evaluation framework that can steer MT metric development and evaluation toward more faithful human-aligned judgments across languages and domains.

Abstract

Automatic metrics are commonly used as the exclusive tool for declaring the superiority of one machine translation system's quality over another. The community choice of automatic metric guides research directions and industrial developments by deciding which models are deemed better. Evaluating metrics correlations with sets of human judgements has been limited by the size of these sets. In this paper, we corroborate how reliable metrics are in contrast to human judgements on -- to the best of our knowledge -- the largest collection of judgements reported in the literature. Arguably, pairwise rankings of two systems are the most common evaluation tasks in research or deployment scenarios. Taking human judgement as a gold standard, we investigate which metrics have the highest accuracy in predicting translation quality rankings for such system pairs. Furthermore, we evaluate the performance of various metrics across different language pairs and domains. Lastly, we show that the sole use of BLEU impeded the development of improved models leading to bad deployment decisions. We release the collection of 2.3M sentence-level human judgements for 4380 systems for further analysis and replication of our work.

To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

TL;DR

Abstract

To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)