Table of Contents
Fetching ...

Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy

Brian Thompson, Nitika Mathur, Daniel Deutsch, Huda Khayrallah

TL;DR

This work introduces Soft Pairwise Accuracy (SPA), a permutation-based meta-metric that extends Pairwise Accuracy by incorporating statistical significance from both human judgments and automatic metrics. The core SPA formulation, $\text{SPA} = {{N \choose 2}}^{-1} \sum_{i<j} \left(1 - \left| p_{ij}^{h} - p_{ij}^{m} \right|\right)$, uses continuous $p$-values estimated via paired permutation tests, making it more robust to uncertainty than PA. Empirical analyses show SPA is more stable to the number of systems and segments, resolves PA's discrete tie issues, and yields more statistically significant pairwise comparisons and clustering of metrics. The results support SPA’s practical value, culminating in its adoption as the official system-level meta-metric for the 2024 WMT Metrics Shared Task.

Abstract

Selecting an automatic metric that best emulates human annotators is often non-trivial, because there is no clear definition of "best emulates." A meta-metric is required to compare the human judgments to the automatic metric scores, and metric rankings depend on the choice of meta-metric. We propose Soft Pairwise Accuracy (SPA), a new meta-metric that builds on Pairwise Accuracy (PA) but incorporates the statistical significance of both the human judgments and the metric scores. We show that SPA is more stable than PA with respect to changes in the number of systems/segments used for evaluation. We also show that PA can only assign a small set of distinct output values to metrics, and this results in many metrics being artificially assigned the exact same PA score. We demonstrate that SPA fixes this issue. Finally, we show that SPA is more discriminative than PA, producing more statistically significant comparisons between metrics. SPA was selected as the official system-level metric for the 2024 WMT Metrics Shared Task.

Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy

TL;DR

This work introduces Soft Pairwise Accuracy (SPA), a permutation-based meta-metric that extends Pairwise Accuracy by incorporating statistical significance from both human judgments and automatic metrics. The core SPA formulation, , uses continuous -values estimated via paired permutation tests, making it more robust to uncertainty than PA. Empirical analyses show SPA is more stable to the number of systems and segments, resolves PA's discrete tie issues, and yields more statistically significant pairwise comparisons and clustering of metrics. The results support SPA’s practical value, culminating in its adoption as the official system-level meta-metric for the 2024 WMT Metrics Shared Task.

Abstract

Selecting an automatic metric that best emulates human annotators is often non-trivial, because there is no clear definition of "best emulates." A meta-metric is required to compare the human judgments to the automatic metric scores, and metric rankings depend on the choice of meta-metric. We propose Soft Pairwise Accuracy (SPA), a new meta-metric that builds on Pairwise Accuracy (PA) but incorporates the statistical significance of both the human judgments and the metric scores. We show that SPA is more stable than PA with respect to changes in the number of systems/segments used for evaluation. We also show that PA can only assign a small set of distinct output values to metrics, and this results in many metrics being artificially assigned the exact same PA score. We demonstrate that SPA fixes this issue. Finally, we show that SPA is more discriminative than PA, producing more statistically significant comparisons between metrics. SPA was selected as the official system-level metric for the 2024 WMT Metrics Shared Task.
Paper Structure (16 sections, 6 equations, 4 figures, 1 table)

This paper contains 16 sections, 6 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Illustration of the individual components used to calculate both SPA and PA for the Prism metric thompson-post-2020-automaticthompson-post-2020-paraphrase on the WMT 2023 English-German language pair. Each box represents a comparison between two systems, systems $i$ and $j$. MT systems are sorted by average human judgment score for easier interpretation. The right column is one minus the absolute difference between the human preference for systems $i$ over system $j$ (left column) and the metric preference for system $i$ over system $j$ (middle column). In PA (top row), human and metric preferences are binarized to 0 and 1, and PA is thus an average of binary terms. In SPA (bottom row), human and metric preferences range from 0 to 1, and as a result SPA is an average of values ranging from 0 to 1. SPA can be viewed as a "soft" extension to pairwise accuracy that incorporates both human judgment and metric uncertainty, allowing for partial credit.
  • Figure 2: Final metric ranking stability when ablating the number of MT systems (and thus the number of total MQM judgments), measured as change in Pearson correlation coefficient (Pearson $r$) from the ranking computed on all MT systems. Values are averaged over 1000 random trials. We find SPA to be more stable than PA in all cases.
  • Figure 3: The 95% confidence intervals for SPA (blue) and PA (red) on Metric-X (top) and XCOMET (bottom) when varying the number of annotations per system. We find that SPA has a tighter confidence interval, and that the confidence interval shrinks to its full value with smaller sample sizes than PA.
  • Figure 4: Metric Comparison Significance, WMT 2022 En$\rightarrow$De. Note that PA only assigns 11 distinct values to the 21 metrics (ties are shown in alternating Purple and Yellow text), whereas SPA produces a distinct value for each of the 21 metrics. SPA produces more statistically significant ($p$-value <= 0.05, shown in green) comparisons between metrics (163 vs 108). As a result, SPA divides the metrics into 8 significance clusters (delineated with blue lines) compared to only 5 for PA. Results for other language pairs (not shown) are similar.