Schroedinger's Threshold: When the AUC doesn't predict Accuracy
Juri Opitz
TL;DR
The paper demonstrates that $AUC$ (AUROC) can misrepresent downstream accuracy when evaluating binary faithfulness predictors across diverse NLP domains. By simulating deployment with cross-domain calibration using Platt scaling, isotonic regression, and a decision stump, it measures an 'expected accuracy' that reflects real-world decision thresholds. The authors show substantial ranking differences between $AUC$ and downstream accuracy (e.g., Q2 shifts from 3rd to 1st) and reveal strong effects from calibration data and methods, with domain generalization proving difficult. The work argues for calibration-aware evaluation in benchmarking faithfulness metrics, highlighting practical implications for method selection and deployment in varied applications.
Abstract
The Area Under Curve measure (AUC) seems apt to evaluate and compare diverse models, possibly without calibration. An important example of AUC application is the evaluation and benchmarking of models that predict faithfulness of generated text. But we show that the AUC yields an academic and optimistic notion of accuracy that can misalign with the actual accuracy observed in application, yielding significant changes in benchmark rankings. To paint a more realistic picture of downstream model performance (and prepare a model for actual application), we explore different calibration modes, testing calibration data and method.
