The price of debiasing automatic metrics in natural language evaluation
Arun Tejasvi Chaganty, Stephen Mussman, Percy Liang
TL;DR
This paper addresses the bias and cost of evaluating natural language generation by combining cheap automatic metrics with expensive human judgments via a control variates estimator. It proves that the estimator μ̂_cv is minimax-optimal for unbiased evaluation under fixed variances and correlation, and analyzes data efficiency as a function of annotator variance and metric–human correlation. Empirically, using current metrics and prompts yields modest cost reductions (7–13%), with larger gains contingent on improving both the automatic metric and the evaluation prompt. The work highlights the two bottlenecks—the metric’s correlation with human judgments and the annotator prompt—and suggests directions for achieving more substantial efficiency improvements, including post-editing interfaces and better metrics.
Abstract
For evaluating generation systems, automatic metrics such as BLEU cost nothing to run but have been shown to correlate poorly with human judgment, leading to systematic bias against certain model improvements. On the other hand, averaging human judgments, the unbiased gold standard, is often too expensive. In this paper, we use control variates to combine automatic metrics with human evaluation to obtain an unbiased estimator with lower cost than human evaluation alone. In practice, however, we obtain only a 7-13% cost reduction on evaluating summarization and open-response question answering systems. We then prove that our estimator is optimal: there is no unbiased estimator with lower cost. Our theory further highlights the two fundamental bottlenecks---the automatic metric and the prompt shown to human evaluators---both of which need to be improved to obtain greater cost savings.
