Table of Contents
Fetching ...

Don't Sweat the Small Stuff: Segment-Level Meta-Evaluation Based on Pairwise Difference Correlation

Colten DiIanni, Daniel Deutsch

TL;DR

The paper identifies shortcomings of traditional meta-evaluation metrics in MT, notably susceptibility to outliers and limited use of distributional information, and proposes Pairwise Difference Pearson ($PDP$). PDP computes the Global Pearson correlation on intra-segment pairwise score differences ($X^*$, $Y^*$), thereby leveraging information from all segments while focusing on intra-segment dynamics and eliminating cross-segment raw-score differences. Empirical results on WMT'23 and WMT'24 show PDP better aligns with human error weighting and down-weights sentinel-cand-mqm relative to $acc_{eq}$, with robust performance under noise and improved ranking of human annotations. The approach is shown to generalize to other segment-level meta-evaluation tasks in NLP, offering a distribution-aware alternative to existing metrics.

Abstract

This paper introduces Pairwise Difference Pearson (PDP), a novel segment-level meta-evaluation metric for Machine Translation (MT) that address limitations in previous Pearson's $ρ$-based and and Kendall's $τ$-based meta-evaluation approaches. PDP is a correlation-based metric that utilizes pairwise differences rather than raw scores. It draws on information from all segments for a more robust understanding of score distributions and uses segment-wise pairwise differences to refine Global Pearson to intra-segment score comparisons. Analysis on the WMT'24 shared task shows PDP properly ranks sentinel evaluation metrics and better aligns with human error weightings than previous work. Noise injection analysis demonstrates PDP's robustness to random noise, segment bias, and system bias while highlighting its sensitivity to extreme outliers.

Don't Sweat the Small Stuff: Segment-Level Meta-Evaluation Based on Pairwise Difference Correlation

TL;DR

The paper identifies shortcomings of traditional meta-evaluation metrics in MT, notably susceptibility to outliers and limited use of distributional information, and proposes Pairwise Difference Pearson (). PDP computes the Global Pearson correlation on intra-segment pairwise score differences (, ), thereby leveraging information from all segments while focusing on intra-segment dynamics and eliminating cross-segment raw-score differences. Empirical results on WMT'23 and WMT'24 show PDP better aligns with human error weighting and down-weights sentinel-cand-mqm relative to , with robust performance under noise and improved ranking of human annotations. The approach is shown to generalize to other segment-level meta-evaluation tasks in NLP, offering a distribution-aware alternative to existing metrics.

Abstract

This paper introduces Pairwise Difference Pearson (PDP), a novel segment-level meta-evaluation metric for Machine Translation (MT) that address limitations in previous Pearson's -based and and Kendall's -based meta-evaluation approaches. PDP is a correlation-based metric that utilizes pairwise differences rather than raw scores. It draws on information from all segments for a more robust understanding of score distributions and uses segment-wise pairwise differences to refine Global Pearson to intra-segment score comparisons. Analysis on the WMT'24 shared task shows PDP properly ranks sentinel evaluation metrics and better aligns with human error weightings than previous work. Noise injection analysis demonstrates PDP's robustness to random noise, segment bias, and system bias while highlighting its sensitivity to extreme outliers.

Paper Structure

This paper contains 18 sections, 8 equations, 1 figure, 6 tables.

Figures (1)

  • Figure 1: SDP for Segment-Wise Pearson's $\rho$, Global Pearson's $\rho$, $acc_{eq}$, and PDP under increasing levels of noise. Lower SDP values indicate greater stability and robustness to noise.