The statistical advantage of automatic NLG metrics at the system level
Johnny Tian-Zheng Wei, Robin Jia
TL;DR
This paper interrogates whether automatic NLG metrics can outperform human judgments for system-level evaluation by framing metric evaluation as a bias-variance-noise problem. It formalizes system-level scores, introduces a bootstrap-based bias-variance-noise decomposition for pairwise predictions, and empirically analyzes MT and summarization datasets (WMT and SummEval). The authors show that metrics can exhibit lower variance and, in some regimes (e.g., few human judgments or small system quality differences), achieve more accurate pairwise predictions than humans, despite bias. They further compare metrics to a theoretical perfect annotator, perform power analyses, and discuss the practical limits of human evaluation, offering best practices and strategies to push evaluation forward. The work provides actionable insights for when to rely on metrics, how to quantify their limits, and how to design more informative evaluation protocols with substantial implications for MT/NLG development and benchmarking.
Abstract
Estimating the expected output quality of generation systems is central to NLG. This paper qualifies the notion that automatic metrics are not as good as humans in estimating system-level quality. Statistically, humans are unbiased, high variance estimators, while metrics are biased, low variance estimators. We compare these estimators by their error in pairwise prediction (which generation system is better?) using the bootstrap. Measuring this error is complicated: predictions are evaluated against noisy, human predicted labels instead of the ground truth, and metric predictions fluctuate based on the test sets they were calculated on. By applying a bias-variance-noise decomposition, we adjust this error to a noise-free, infinite test set setting. Our analysis compares the adjusted error of metrics to humans and a derived, perfect segment-level annotator, both of which are unbiased estimators dependent on the number of judgments collected. In MT, we identify two settings where metrics outperform humans due to a statistical advantage in variance: when the number of human judgments used is small, and when the quality difference between compared systems is small. The data and code to reproduce our analyses are available at https://github.com/johntzwei/metric-statistical-advantage .
