Table of Contents
Fetching ...

Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In!

Stefano Perrella, Lorenzo Proietti, Alessandro Scirè, Edoardo Barba, Roberto Navigli

TL;DR

This paper reveals that the standard WMT MT metric meta-evaluation can be biased toward trained and continuous metrics due to grouping and tie-handling practices. By introducing sentinel metrics that purposely rely on partial information, the authors demonstrate how spurious correlations can inflate certain metric rankings, especially under No Grouping or System Grouping. They show that Segment Grouping mitigates these biases and that tie calibration can unduly favor continuous metrics, with held-out calibration not solving the fairness issue. The work also provides two new sentinels derived from GEMBA-MQM and MaTESe to illustrate the impact of continuity on ranking, and highlights strong correlations between sentinels and state-of-the-art metrics, raising concerns about reliance on training data correlations. Overall, the study advocates for fairer meta-evaluation practices and calls for deeper analyses of continuous versus discrete metric behaviors to improve MT evaluation reliability and interpretability.

Abstract

Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics, ranking them according to their correlation with human judgments. Their results guide researchers toward enhancing the next generation of metrics and MT systems. With the recent introduction of neural metrics, the field has witnessed notable advancements. Nevertheless, the inherent opacity of these metrics has posed substantial challenges to the meta-evaluation process. This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings. To do this, we introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness. By employing sentinel metrics, we aim to validate our findings, and shed light on and monitor the potential biases or inconsistencies in the rankings. We discover that the present meta-evaluation framework favors two categories of metrics: i) those explicitly trained to mimic human quality assessments, and ii) continuous metrics. Finally, we raise concerns regarding the evaluation capabilities of state-of-the-art metrics, emphasizing that they might be basing their assessments on spurious correlations found in their training data.

Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In!

TL;DR

This paper reveals that the standard WMT MT metric meta-evaluation can be biased toward trained and continuous metrics due to grouping and tie-handling practices. By introducing sentinel metrics that purposely rely on partial information, the authors demonstrate how spurious correlations can inflate certain metric rankings, especially under No Grouping or System Grouping. They show that Segment Grouping mitigates these biases and that tie calibration can unduly favor continuous metrics, with held-out calibration not solving the fairness issue. The work also provides two new sentinels derived from GEMBA-MQM and MaTESe to illustrate the impact of continuity on ranking, and highlights strong correlations between sentinels and state-of-the-art metrics, raising concerns about reliance on training data correlations. Overall, the study advocates for fairer meta-evaluation practices and calls for deeper analyses of continuous versus discrete metric behaviors to improve MT evaluation reliability and interpretability.

Abstract

Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics, ranking them according to their correlation with human judgments. Their results guide researchers toward enhancing the next generation of metrics and MT systems. With the recent introduction of neural metrics, the field has witnessed notable advancements. Nevertheless, the inherent opacity of these metrics has posed substantial challenges to the meta-evaluation process. This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings. To do this, we introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness. By employing sentinel metrics, we aim to validate our findings, and shed light on and monitor the potential biases or inconsistencies in the rankings. We discover that the present meta-evaluation framework favors two categories of metrics: i) those explicitly trained to mimic human quality assessments, and ii) continuous metrics. Finally, we raise concerns regarding the evaluation capabilities of state-of-the-art metrics, emphasizing that they might be basing their assessments on spurious correlations found in their training data.
Paper Structure (23 sections, 3 equations, 12 figures, 14 tables)

This paper contains 23 sections, 3 equations, 12 figures, 14 tables.

Figures (12)

  • Figure 1: We show XCOMET-Ensemble assessments and MQM-based human judgments in the top and bottom figures, respectively, over the length of the candidate translation (in characters). The red line represents the linear least-squares regression. MQM human judgments smaller than $-25$ have been removed for improved clarity. The language pair is $\textsc{zh}\rightarrow\textsc{en}$.
  • Figure 2: $\text{acc}_{\text{eq}}$ (left) and optimal $\epsilon$ (right) of the considered metrics for varying percentages of human ties in the test dataset, where $0.24$ is the percentage of human ties in the entire dataset, obtained when $p_t$ and $p_n$ are both $0$. $\epsilon$ values have been scaled using min-max scaling. Specifically, for each metric, the minimum $\epsilon$ is the optimal $\epsilon$ at $0\%$ of human ties, and the maximum is the optimal $\epsilon$ at $100\%$. The language direction is $\textsc{zh}\rightarrow\textsc{en}$. Results concerning all language directions can be found in Appendix \ref{['apx:ties']}. For each percentage of human ties, we use $5$ different seeds to sub-sample the test data. Therefore, the shown $\text{acc}_{\text{eq}}$ and $\epsilon$, for each metric and percentage of ties, are averaged across $5$ different runs.
  • Figure 3: $\text{acc}_{\text{eq}}$ of the considered metrics when tie calibration is conducted on a held-out set, derived as a $20\%$ split of the test set, and repeatedly sub-sampled to modify its percentage of tied human scores. The x-axis represents the percentage of ties in the held-out set, while the y-axis represents the $\text{acc}_{\text{eq}}$, as computed on the remaining $80\%$ of the test set. The language direction is $\textsc{zh}\rightarrow\textsc{en}$, and results concerning all language directions can be found in Appendix \ref{['apx:ties']}. The percentage of human ties in the $80\%$ split of the test set is $24\%$.
  • Figure 4: Pairwise correlation between a part of the primary submissions and baselines of WMT23, and sentinel metrics. Correlation is Pearson with No Grouping, and the language direction is $\textsc{zh}\rightarrow\textsc{en}$.
  • Figure 5: Pairwise correlation between a part of the primary submissions and baselines of WMT23, and sentinel metrics. Correlation is Pearson with No Grouping, and the language direction is $\textsc{en}\rightarrow\textsc{de}$.
  • ...and 7 more figures