Trust, or Don't Predict: Introducing the CWSA Family for Confidence-Aware Model Evaluation
Kourosh Shahnazari, Seyed Moein Ayyoubzadeh, Mohammadali Keshtparvar, Pegah Ghaffari
TL;DR
The paper tackles trustworthy model evaluation under abstention by proposing two threshold-local metrics, Confidence-Weighted Selective Accuracy (CWSA) and its normalized variant CWSA+. These metrics reward correct predictions proportional to confidence via a linear weight $\phi(c) = \frac{c-\tau}{1-\tau}$ for $c\ge\tau$ and penalize overconfident errors, formalized as $\text{CWSA}(\tau) = \frac{1}{|\mathcal{S}_\tau|} \sum_{i\in\mathcal{S}_\tau} \phi(c_i) \cdot (2\mathbb{I}\{\hat{y_i}=y_i\}-1)$ and $\text{CWSA}^+(\tau) = \frac{1}{|\mathcal{S}_\tau|} \sum_{i\in\mathcal{S}_\tau} \phi(c_i) \cdot \mathbb{I}\{\hat{y_i}=y_i\}$. Through experiments on MNIST, CIFAR-10, and synthetic models, they show CWSA/CWSA+ effectively identify nuanced failure modes and outperform classical metrics (accuracy, ECE, AURC) in trust-sensitive scenarios. The threshold-local and decomposable nature facilitates threshold tuning and deployment, with practical implications for safety-critical systems. Limitations include reliance on calibrated confidence and non-differentiability from hard thresholds, suggesting future work on smoothing and extensions to regression or structured outputs.
Abstract
In recent machine learning systems, confidence scores are being utilized more and more to manage selective prediction, whereby a model can abstain from making a prediction when it is unconfident. Yet, conventional metrics like accuracy, expected calibration error (ECE), and area under the risk-coverage curve (AURC) do not capture the actual reliability of predictions. These metrics either disregard confidence entirely, dilute valuable localized information through averaging, or neglect to suitably penalize overconfident misclassifications, which can be particularly detrimental in real-world systems. We introduce two new metrics Confidence-Weighted Selective Accuracy (CWSA) and its normalized variant CWSA+ that offer a principled and interpretable way to evaluate predictive models under confidence thresholds. Unlike existing methods, our metrics explicitly reward confident accuracy and penalize overconfident mistakes. They are threshold-local, decomposable, and usable in both evaluation and deployment settings where trust and risk must be quantified. Through exhaustive experiments on both real-world data sets (MNIST, CIFAR-10) and artificial model variants (calibrated, overconfident, underconfident, random, perfect), we show that CWSA and CWSA+ both effectively detect nuanced failure modes and outperform classical metrics in trust-sensitive tests. Our results confirm that CWSA is a sound basis for developing and assessing selective prediction systems for safety-critical domains.
