Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In!
Stefano Perrella, Lorenzo Proietti, Alessandro Scirè, Edoardo Barba, Roberto Navigli
TL;DR
This paper reveals that the standard WMT MT metric meta-evaluation can be biased toward trained and continuous metrics due to grouping and tie-handling practices. By introducing sentinel metrics that purposely rely on partial information, the authors demonstrate how spurious correlations can inflate certain metric rankings, especially under No Grouping or System Grouping. They show that Segment Grouping mitigates these biases and that tie calibration can unduly favor continuous metrics, with held-out calibration not solving the fairness issue. The work also provides two new sentinels derived from GEMBA-MQM and MaTESe to illustrate the impact of continuity on ranking, and highlights strong correlations between sentinels and state-of-the-art metrics, raising concerns about reliance on training data correlations. Overall, the study advocates for fairer meta-evaluation practices and calls for deeper analyses of continuous versus discrete metric behaviors to improve MT evaluation reliability and interpretability.
Abstract
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics, ranking them according to their correlation with human judgments. Their results guide researchers toward enhancing the next generation of metrics and MT systems. With the recent introduction of neural metrics, the field has witnessed notable advancements. Nevertheless, the inherent opacity of these metrics has posed substantial challenges to the meta-evaluation process. This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings. To do this, we introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness. By employing sentinel metrics, we aim to validate our findings, and shed light on and monitor the potential biases or inconsistencies in the rankings. We discover that the present meta-evaluation framework favors two categories of metrics: i) those explicitly trained to mimic human quality assessments, and ii) continuous metrics. Finally, we raise concerns regarding the evaluation capabilities of state-of-the-art metrics, emphasizing that they might be basing their assessments on spurious correlations found in their training data.
