Table of Contents
Fetching ...

The price of debiasing automatic metrics in natural language evaluation

Arun Tejasvi Chaganty, Stephen Mussman, Percy Liang

TL;DR

This paper addresses the bias and cost of evaluating natural language generation by combining cheap automatic metrics with expensive human judgments via a control variates estimator. It proves that the estimator μ̂_cv is minimax-optimal for unbiased evaluation under fixed variances and correlation, and analyzes data efficiency as a function of annotator variance and metric–human correlation. Empirically, using current metrics and prompts yields modest cost reductions (7–13%), with larger gains contingent on improving both the automatic metric and the evaluation prompt. The work highlights the two bottlenecks—the metric’s correlation with human judgments and the annotator prompt—and suggests directions for achieving more substantial efficiency improvements, including post-editing interfaces and better metrics.

Abstract

For evaluating generation systems, automatic metrics such as BLEU cost nothing to run but have been shown to correlate poorly with human judgment, leading to systematic bias against certain model improvements. On the other hand, averaging human judgments, the unbiased gold standard, is often too expensive. In this paper, we use control variates to combine automatic metrics with human evaluation to obtain an unbiased estimator with lower cost than human evaluation alone. In practice, however, we obtain only a 7-13% cost reduction on evaluating summarization and open-response question answering systems. We then prove that our estimator is optimal: there is no unbiased estimator with lower cost. Our theory further highlights the two fundamental bottlenecks---the automatic metric and the prompt shown to human evaluators---both of which need to be improved to obtain greater cost savings.

The price of debiasing automatic metrics in natural language evaluation

TL;DR

This paper addresses the bias and cost of evaluating natural language generation by combining cheap automatic metrics with expensive human judgments via a control variates estimator. It proves that the estimator μ̂_cv is minimax-optimal for unbiased evaluation under fixed variances and correlation, and analyzes data efficiency as a function of annotator variance and metric–human correlation. Empirically, using current metrics and prompts yields modest cost reductions (7–13%), with larger gains contingent on improving both the automatic metric and the evaluation prompt. The work highlights the two bottlenecks—the metric’s correlation with human judgments and the annotator prompt—and suggests directions for achieving more substantial efficiency improvements, including post-editing interfaces and better metrics.

Abstract

For evaluating generation systems, automatic metrics such as BLEU cost nothing to run but have been shown to correlate poorly with human judgment, leading to systematic bias against certain model improvements. On the other hand, averaging human judgments, the unbiased gold standard, is often too expensive. In this paper, we use control variates to combine automatic metrics with human evaluation to obtain an unbiased estimator with lower cost than human evaluation alone. In practice, however, we obtain only a 7-13% cost reduction on evaluating summarization and open-response question answering systems. We then prove that our estimator is optimal: there is no unbiased estimator with lower cost. Our theory further highlights the two fundamental bottlenecks---the automatic metric and the prompt shown to human evaluators---both of which need to be improved to obtain greater cost savings.

Paper Structure

This paper contains 25 sections, 6 theorems, 27 equations, 8 figures, 2 tables, 1 algorithm.

Key Result

Theorem 3.1

Among all unbiased estimators that are functions of $y^{(i)}$ and $g(z^{(i)})$, and for all distributions with a given $\sigma^2_f$, $\sigma^2_a$, and $\alpha$, and no other estimator has a lower worst-case variance.

Figures (8)

  • Figure 1: (a) At a system-level, automatic metrics (ROUGE-L) and human judgment correlate well, but (b) the instance-level correlation plot (where each point is a system prediction) shows that the instance-level correlation is quite low ($\rho = 0.31$). As a consequence, if we try to locally improve systems to produce better answers ($\triangleright$ in (a)), they do not significantly improve ROUGE scores and vice versa ($\vartriangle$).
  • Figure 2: The samples from $f(z)$ have a higher variance than the samples from $f(z)-g(z)$ but the same mean. This is the key idea behind using control variates to reduce variance.
  • Figure 3: Inverse data efficiency for various values of $\gamma$ and $\rho$. We need both low $\gamma$ and high $\rho$ to obtain significant gains.
  • Figure 4: Screenshots of the annotation interfaces we used to measure (a) summary language quality on CNN/Daily Mail and (b) answer correctness on MS MARCO tasks.
  • Figure 5: Correlations of different automatic metrics on the MS MARCO and CNN/Daily Mail tasks. Certain systems are more correlated with certain automatic metrics than others, but overall the correlation is low to moderate for most systems and metrics.
  • ...and 3 more figures

Theorems & Definitions (10)

  • Theorem 3.1
  • Proposition 3.1
  • Lemma B.1
  • proof
  • Lemma B.2
  • proof
  • Theorem 1: \ref{['thm:main']}
  • proof
  • Proposition 1: \ref{['prop:added_bias']}
  • proof