Table of Contents
Fetching ...

A Closer Look at AUROC and AUPRC under Class Imbalance

Matthew B. A. McDermott, Haoran Zhang, Lasse Hyldig Hansen, Giovanni Angelotti, Jack Gallifant

TL;DR

This work challenges the widespread belief that AUPRC universally outperforms AUROC under class imbalance. Through theoretical theorems and empirical validation on synthetic and real-world fairness datasets, the authors show that AUPRC weights high-score errors more heavily, can amplify disparities across subpopulations, and is not universally advantageous for model evaluation or deployment. They also reveal extensive misattribution in the literature linking AUPRC to superior performance in imbalanced settings. The paper provides context-aware guidance for metric selection and warns against unchecked generalizations, highlighting ethical considerations in fairness-sensitive applications. Overall, it advances the technical understanding of AUROC vs AUPRC and advocates for careful, deployment-aware metric reporting in ML practice.

Abstract

In machine learning (ML), a widespread claim is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for tasks with class imbalance. This paper refutes this notion on two fronts. First, we theoretically characterize the behavior of AUROC and AUPRC in the presence of model mistakes, establishing clearly that AUPRC is not generally superior in cases of class imbalance. We further show that AUPRC can be a harmful metric as it can unduly favor model improvements in subpopulations with more frequent positive labels, heightening algorithmic disparities. Next, we empirically support our theory using experiments on both semi-synthetic and real-world fairness datasets. Prompted by these insights, we conduct a review of over 1.5 million scientific papers to understand the origin of this invalid claim, finding that it is often made without citation, misattributed to papers that do not argue this point, and aggressively over-generalized from source arguments. Our findings represent a dual contribution: a significant technical advancement in understanding the relationship between AUROC and AUPRC and a stark warning about unchecked assumptions in the ML community.

A Closer Look at AUROC and AUPRC under Class Imbalance

TL;DR

This work challenges the widespread belief that AUPRC universally outperforms AUROC under class imbalance. Through theoretical theorems and empirical validation on synthetic and real-world fairness datasets, the authors show that AUPRC weights high-score errors more heavily, can amplify disparities across subpopulations, and is not universally advantageous for model evaluation or deployment. They also reveal extensive misattribution in the literature linking AUPRC to superior performance in imbalanced settings. The paper provides context-aware guidance for metric selection and warns against unchecked generalizations, highlighting ethical considerations in fairness-sensitive applications. Overall, it advances the technical understanding of AUROC vs AUPRC and advocates for careful, deployment-aware metric reporting in ML practice.

Abstract

In machine learning (ML), a widespread claim is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for tasks with class imbalance. This paper refutes this notion on two fronts. First, we theoretically characterize the behavior of AUROC and AUPRC in the presence of model mistakes, establishing clearly that AUPRC is not generally superior in cases of class imbalance. We further show that AUPRC can be a harmful metric as it can unduly favor model improvements in subpopulations with more frequent positive labels, heightening algorithmic disparities. Next, we empirically support our theory using experiments on both semi-synthetic and real-world fairness datasets. Prompted by these insights, we conduct a review of over 1.5 million scientific papers to understand the origin of this invalid claim, finding that it is often made without citation, misattributed to papers that do not argue this point, and aggressively over-generalized from source arguments. Our findings represent a dual contribution: a significant technical advancement in understanding the relationship between AUROC and AUPRC and a stark warning about unchecked assumptions in the ML community.
Paper Structure (42 sections, 8 theorems, 17 equations, 8 figures, 3 tables)

This paper contains 42 sections, 8 theorems, 17 equations, 8 figures, 3 tables.

Key Result

Theorem 1

Let $\mathcal{X}, \mathcal{Y} = \{0, 1\}$ represent a paired feature and binary classification label space from which i.i.d. samples $(x, y) \in \mathcal{X} \times \mathcal{Y}$ are drawn via the joint distribution over the random variables $\mathsf{x}, \mathsf{y}$. Let $f: \mathcal{X} \to (0, 1)$ be

Figures (8)

  • Figure 1: a) Consider a model $f$ yielding continuous output scores for a binary classification task applied to a dataset consisting of two distinct subpopulations, $\mathcal{A} \in \{0, 1\}$. If we order samples in ascending order of output score, each misordered pair of samples (e.g., mistake 1-4) represents an opportunity for model improvement. Theorem \ref{['thm:mistake_order_differences']} shows that a model's AUROC will improve by the same amount no matter which mistake you fix, while the model's AUPRC will improve by an amount correlated with the score of the sample. b) When comparing models absent a specific deployment scenario, we have no reason to value improving one mistake over another, and model evaluation metrics should therefore improve equally regardless of which mistake is corrected. c) When false negatives have a high cost relative to false positives, evaluation metrics should favor mistakes that have lower scores, regardless of any class imbalance. d) When limited resources will be distributed among a population according to model score, in a manner that requires certain subpopulations to all be offered commensurate possible benefit from the intervention for ethical reasons, evaluation metrics should prioritize the importance of within-group, high-score mistakes such that the highest risk members of all subgroups receive interventions. e) When false positives are expensive relative to false negatives and there are no fairness concerns, evaluation metrics should favor model improvements in decreasing order with score.
  • Figure 2: Synthetic experiment per-group AUROC, showing a confidence interval spanning the 5th to 95th percentile of results observed across all seeds, after successively either fixing individual mistakes, as defined in Definition \ref{['def:mistake']}, (a) and b)) or successively choosing the optimal score permutation (c) and d)) in order to optimize either AUROC (a) and c)) or AUPRC (b) and d)). It is clear across both forms of optimization that AUPRC definitively favors the higher prevalence subpopulation, whereas AUROC treats subgroups approximately equally. Similar patterns were observed when comparing per-group AUPRCs over the same experimental procedures, as shown in Appendix Figure \ref{['fig:full_optimization_experiment']}.
  • Figure 3: Difference in the Spearman's $\rho$ between the test-set signed AUROC gap versus the validation set overall AUPRC, and the AUROC gap versus the overall AUROC. Numbers in parentheses are the prevalence ratios between the two groups for the particular attribute, and datasets are sorted by this quantity. Error bars are 95% confidence intervals from 20 different random data splits.
  • Figure 4: Comparison of the impact of optimizing for overall AUROC and overall AUPRC on the per-group AUROC and AUPRCs of two groups in a synthetic setting, using both the sequentially fixing individual mistakes optimization procedure (M2; top) and the sequentially permuting nearby scores optimization procedure (M3; bottom) described in Section \ref{['subsec:synthetic_exp']}. Note that the prevalence of $Y$ in the high-prevalence group and the low-prevalence group are 0.05 and 0.01 respectively.
  • Figure 5: Comparison of the impact of optimizing for overall AUROC and overall AUPRC on the per-group AUROC and AUPRCs of two groups in a synthetic setting where the initial AUROC was set to 0.65 rather than 0.85, using both the sequentially fixing individual mistakes optimization procedure (M2; top) and the sequentially permuting nearby scores optimization procedure (M3; bottom) described in Section \ref{['subsec:synthetic_exp']}. Note that the prevalence of $Y$ in the high-prevalence group and the low-prevalence group are 0.05 and 0.01 respectively.
  • ...and 3 more figures

Theorems & Definitions (15)

  • Claim 1
  • Theorem 1
  • Definition 2.1
  • Theorem 2
  • Theorem 3
  • Theorem 3
  • proof
  • Theorem 3
  • proof
  • Lemma 1
  • ...and 5 more