Schroedinger's Threshold: When the AUC doesn't predict Accuracy

Juri Opitz

Schroedinger's Threshold: When the AUC doesn't predict Accuracy

Juri Opitz

TL;DR

The paper demonstrates that $AUC$ (AUROC) can misrepresent downstream accuracy when evaluating binary faithfulness predictors across diverse NLP domains. By simulating deployment with cross-domain calibration using Platt scaling, isotonic regression, and a decision stump, it measures an 'expected accuracy' that reflects real-world decision thresholds. The authors show substantial ranking differences between $AUC$ and downstream accuracy (e.g., Q2 shifts from 3rd to 1st) and reveal strong effects from calibration data and methods, with domain generalization proving difficult. The work argues for calibration-aware evaluation in benchmarking faithfulness metrics, highlighting practical implications for method selection and deployment in varied applications.

Abstract

The Area Under Curve measure (AUC) seems apt to evaluate and compare diverse models, possibly without calibration. An important example of AUC application is the evaluation and benchmarking of models that predict faithfulness of generated text. But we show that the AUC yields an academic and optimistic notion of accuracy that can misalign with the actual accuracy observed in application, yielding significant changes in benchmark rankings. To paint a more realistic picture of downstream model performance (and prepare a model for actual application), we explore different calibration modes, testing calibration data and method.

Schroedinger's Threshold: When the AUC doesn't predict Accuracy

TL;DR

The paper demonstrates that

(AUROC) can misrepresent downstream accuracy when evaluating binary faithfulness predictors across diverse NLP domains. By simulating deployment with cross-domain calibration using Platt scaling, isotonic regression, and a decision stump, it measures an 'expected accuracy' that reflects real-world decision thresholds. The authors show substantial ranking differences between

and downstream accuracy (e.g., Q2 shifts from 3rd to 1st) and reveal strong effects from calibration data and methods, with domain generalization proving difficult. The work argues for calibration-aware evaluation in benchmarking faithfulness metrics, highlighting practical implications for method selection and deployment in varied applications.

Abstract

Paper Structure (22 sections, 4 equations, 4 figures, 5 tables)

This paper contains 22 sections, 4 equations, 4 figures, 5 tables.

Introcuction
Preliminaries
AUC (or AUROC)
AUC seems appealing (theoretically):
Experimental setup
Data sets
Measurement of expected accuracy
AUC mispredicts accuracy
Experiment goal
Experiment results
Studying score distribution
Why would Q2 be preferable over ANLI?
Less variance $\rightarrow$ easier calibration?
Analysis
Effect of calibration technique
...and 7 more sections

Figures (4)

Figure 1: In NLP we witness diverse domains and tasks (here: dialog, faithfulness), and wonder about the predictive power of scores by diverse models (here: e.g., the BERT/BARTscore metric, task-focused systems such as the automatic Q/A metric 'Q$^2$' or Natural Language Inference systems, possibly also LLMs). While the AUC seems appealing as an assessment measure, it bears pitfalls.
Figure 2: ROC curve examples of different models.
Figure 3: Histograms of best performing models Q2 and ANLI. Q2 performs best according to expected accuracy, ANLI performs best according to AUC.
Figure 4: Top: histograms of models that perform better under expected accuracy (vs. AUC). Bottom: histograms of models that perform worse.

Schroedinger's Threshold: When the AUC doesn't predict Accuracy

TL;DR

Abstract

Schroedinger's Threshold: When the AUC doesn't predict Accuracy

Authors

TL;DR

Abstract

Table of Contents

Figures (4)