Table of Contents
Fetching ...

The statistical advantage of automatic NLG metrics at the system level

Johnny Tian-Zheng Wei, Robin Jia

TL;DR

This paper interrogates whether automatic NLG metrics can outperform human judgments for system-level evaluation by framing metric evaluation as a bias-variance-noise problem. It formalizes system-level scores, introduces a bootstrap-based bias-variance-noise decomposition for pairwise predictions, and empirically analyzes MT and summarization datasets (WMT and SummEval). The authors show that metrics can exhibit lower variance and, in some regimes (e.g., few human judgments or small system quality differences), achieve more accurate pairwise predictions than humans, despite bias. They further compare metrics to a theoretical perfect annotator, perform power analyses, and discuss the practical limits of human evaluation, offering best practices and strategies to push evaluation forward. The work provides actionable insights for when to rely on metrics, how to quantify their limits, and how to design more informative evaluation protocols with substantial implications for MT/NLG development and benchmarking.

Abstract

Estimating the expected output quality of generation systems is central to NLG. This paper qualifies the notion that automatic metrics are not as good as humans in estimating system-level quality. Statistically, humans are unbiased, high variance estimators, while metrics are biased, low variance estimators. We compare these estimators by their error in pairwise prediction (which generation system is better?) using the bootstrap. Measuring this error is complicated: predictions are evaluated against noisy, human predicted labels instead of the ground truth, and metric predictions fluctuate based on the test sets they were calculated on. By applying a bias-variance-noise decomposition, we adjust this error to a noise-free, infinite test set setting. Our analysis compares the adjusted error of metrics to humans and a derived, perfect segment-level annotator, both of which are unbiased estimators dependent on the number of judgments collected. In MT, we identify two settings where metrics outperform humans due to a statistical advantage in variance: when the number of human judgments used is small, and when the quality difference between compared systems is small. The data and code to reproduce our analyses are available at https://github.com/johntzwei/metric-statistical-advantage .

The statistical advantage of automatic NLG metrics at the system level

TL;DR

This paper interrogates whether automatic NLG metrics can outperform human judgments for system-level evaluation by framing metric evaluation as a bias-variance-noise problem. It formalizes system-level scores, introduces a bootstrap-based bias-variance-noise decomposition for pairwise predictions, and empirically analyzes MT and summarization datasets (WMT and SummEval). The authors show that metrics can exhibit lower variance and, in some regimes (e.g., few human judgments or small system quality differences), achieve more accurate pairwise predictions than humans, despite bias. They further compare metrics to a theoretical perfect annotator, perform power analyses, and discuss the practical limits of human evaluation, offering best practices and strategies to push evaluation forward. The work provides actionable insights for when to rely on metrics, how to quantify their limits, and how to design more informative evaluation protocols with substantial implications for MT/NLG development and benchmarking.

Abstract

Estimating the expected output quality of generation systems is central to NLG. This paper qualifies the notion that automatic metrics are not as good as humans in estimating system-level quality. Statistically, humans are unbiased, high variance estimators, while metrics are biased, low variance estimators. We compare these estimators by their error in pairwise prediction (which generation system is better?) using the bootstrap. Measuring this error is complicated: predictions are evaluated against noisy, human predicted labels instead of the ground truth, and metric predictions fluctuate based on the test sets they were calculated on. By applying a bias-variance-noise decomposition, we adjust this error to a noise-free, infinite test set setting. Our analysis compares the adjusted error of metrics to humans and a derived, perfect segment-level annotator, both of which are unbiased estimators dependent on the number of judgments collected. In MT, we identify two settings where metrics outperform humans due to a statistical advantage in variance: when the number of human judgments used is small, and when the quality difference between compared systems is small. The data and code to reproduce our analyses are available at https://github.com/johntzwei/metric-statistical-advantage .

Paper Structure

This paper contains 30 sections, 19 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Distribution of estimators for the true difference in system quality $\delta^H$ between two generation systems (for illustrative purposes). Notation is defined in § \ref{['section:formal_pairwise_judgments']}. An estimate incurs prediction error if its sign is opposite to the true difference. While humans provide an unbiased estimator of the difference, a biased estimator derived from a metric can have a smaller error probability (shaded areas) due to its lower variance. Evidence supporting the illustration can be found in § \ref{['section:comparing_to_humans']}.
  • Figure 2: Comparison of metrics to human and perfect annotator estimators with varying number of judgments in WMT. Errors are adjusted to an idealized setting where true predictions are used for evaluation and metrics are computed on infinite test sets; here metric predictions become constant, so their errors are constant. Shaded in grey is the region where BERTscore is superhuman. Results for SummEval are in Appendix \ref{['appendix:power_analysis']}.
  • Figure 3: Average agreement of the main prediction to metric predictions computed from varying test set sizes in WMT. The main predictions were derived from all of our data. Each point was an estimated with 10K bootstrap trials. As the size of the test set increases, we see that the agreement monotonically increases. Note that only BLEURT and BERTscore are means of their segment-level scores.
  • Figure 4: Average agreement of the main prediction to metric predictions evaluated on varying test set sizes in SummEval. The main predictions were derived from all of our data. Each point was an estimated with 10K bootstrap trials. As the size of the test set increases, we see that the agreement monotonically increases. Note that all metrics are means of their segment-level scores.
  • Figure 5: Comparison of metrics to human and perfect annotator estimators with varying number of judgments in SummEval. Errors are adjusted to an idealized setting where true predictions are used for evaluation and metrics are computed on infinite test sets; here metric predictions become constant, so their errors are constant. No metric comes close to expert performance at any number of judgments ( ROUGE, the best performing summarization metric, has error $0.221$).
  • ...and 2 more figures