Table of Contents
Fetching ...

A Measure of the System Dependence of Automated Metrics

Pius von Däniken, Jan Deriu, Mark Cieliebak

TL;DR

The paper addresses the gap in automated MT metric evaluation by revealing that a metric’s relationship to human judgments can be system-dependent, potentially misranking systems even when segment-level correlations are high. It introduces the SysDep framework, defined via the global mapping $f_G$ and system-specific mappings $f_k$, and quantifies mispricing through the Expected Deviation $ED(k)$ and the overall metric-system dependence $SysDep$. Using Isotonic Regression and bootstrap ensembles on WMT23 zh-en data, the authors demonstrate substantial SysDep for leading metrics like XCOMET, showing that global monotonicity does not guarantee fair system rankings. The work provides a principled measure for assessing metric fairness, highlights ranking instabilities, and sets the stage for future metric design that minimizes system dependence, with limitations noted to the WMT23 dataset and domain scope.

Abstract

Automated metrics for Machine Translation have made significant progress, with the goal of replacing expensive and time-consuming human evaluations. These metrics are typically assessed by their correlation with human judgments, which captures the monotonic relationship between human and metric scores. However, we argue that it is equally important to ensure that metrics treat all systems fairly and consistently. In this paper, we introduce a method to evaluate this aspect.

A Measure of the System Dependence of Automated Metrics

TL;DR

The paper addresses the gap in automated MT metric evaluation by revealing that a metric’s relationship to human judgments can be system-dependent, potentially misranking systems even when segment-level correlations are high. It introduces the SysDep framework, defined via the global mapping and system-specific mappings , and quantifies mispricing through the Expected Deviation and the overall metric-system dependence . Using Isotonic Regression and bootstrap ensembles on WMT23 zh-en data, the authors demonstrate substantial SysDep for leading metrics like XCOMET, showing that global monotonicity does not guarantee fair system rankings. The work provides a principled measure for assessing metric fairness, highlights ranking instabilities, and sets the stage for future metric design that minimizes system dependence, with limitations noted to the WMT23 dataset and domain scope.

Abstract

Automated metrics for Machine Translation have made significant progress, with the goal of replacing expensive and time-consuming human evaluations. These metrics are typically assessed by their correlation with human judgments, which captures the monotonic relationship between human and metric scores. However, we argue that it is equally important to ensure that metrics treat all systems fairly and consistently. In this paper, we introduce a method to evaluate this aspect.

Paper Structure

This paper contains 13 sections, 4 equations, 1 figure, 9 tables.

Figures (1)

  • Figure 1: Average Human Ratings associated with XCOMET scores on Chinese to English (zh-en) WMT 23 data. We show scores for all system in aggregate (global) and two individual systems.