Uncovering Overconfident Failures in CXR Models via Augmentation-Sensitivity Risk Scoring

Han-Jay Shu; Wei-Ning Chiu; Shun-Ting Chang; Meng-Ping Huang; Takeshi Tohyama; Ahram Han; Po-Chih Kuo

Uncovering Overconfident Failures in CXR Models via Augmentation-Sensitivity Risk Scoring

Han-Jay Shu, Wei-Ning Chiu, Shun-Ting Chang, Meng-Ping Huang, Takeshi Tohyama, Ahram Han, Po-Chih Kuo

TL;DR

This work addresses hidden failures in CXR AI by proposing augmentation-sensitivity risk scoring (ASRS), a label-free framework that detects error-prone cases via representation instability under small rotations. Using RAD-DINO embeddings, ASRS computes $s(x) = \sum_{t \in \mathcal{T}} \lVert z_t - z_0 \rVert_2$ with rotations $\mathcal{T} = {\pm 15^{\circ}, \pm 30^{\circ}}$, and partitions test samples into quartiles G1–G4 based on validation thresholds. Across four diagnostic tasks (Cardiomegaly, Edema, Pneumothorax, Pleural Effusion) and three encoders on MIMIC-CXR-JPG, highly rotation-sensitive cases show a substantial recall deficit ($\approx 0.25$–$0.30$) despite high AUROC and confidence, exposing a hidden failure mode not detected by traditional metrics. ASRS enables selective prediction by auto-accepting stable cases and flagging unstable ones for clinician review, offering a practical path to safer, fairer deployment of medical AI in CXRs.

Abstract

Deep learning models achieve strong performance in chest radiograph (CXR) interpretation, yet fairness and reliability concerns persist. Models often show uneven accuracy across patient subgroups, leading to hidden failures not reflected in aggregate metrics. Existing error detection approaches -- based on confidence calibration or out-of-distribution (OOD) detection -- struggle with subtle within-distribution errors, while image- and representation-level consistency-based methods remain underexplored in medical imaging. We propose an augmentation-sensitivity risk scoring (ASRS) framework to identify error-prone CXR cases. ASRS applies clinically plausible rotations ($\pm 15^\circ$/$\pm 30^\circ$) and measures embedding shifts with the RAD-DINO encoder. Sensitivity scores stratify samples into stability quartiles, where highly sensitive cases show substantially lower recall ($-0.2$ to $-0.3$) despite high AUROC and confidence. ASRS provides a label-free means for selective prediction and clinician review, improving fairness and safety in medical AI.

Uncovering Overconfident Failures in CXR Models via Augmentation-Sensitivity Risk Scoring

TL;DR

with rotations

, and partitions test samples into quartiles G1–G4 based on validation thresholds. Across four diagnostic tasks (Cardiomegaly, Edema, Pneumothorax, Pleural Effusion) and three encoders on MIMIC-CXR-JPG, highly rotation-sensitive cases show a substantial recall deficit (

–

) despite high AUROC and confidence, exposing a hidden failure mode not detected by traditional metrics. ASRS enables selective prediction by auto-accepting stable cases and flagging unstable ones for clinician review, offering a practical path to safer, fairer deployment of medical AI in CXRs.

Abstract

) and measures embedding shifts with the RAD-DINO encoder. Sensitivity scores stratify samples into stability quartiles, where highly sensitive cases show substantially lower recall (

) despite high AUROC and confidence. ASRS provides a label-free means for selective prediction and clinician review, improving fairness and safety in medical AI.

Uncovering Overconfident Failures in CXR Models via Augmentation-Sensitivity Risk Scoring

TL;DR

Abstract

Uncovering Overconfident Failures in CXR Models via Augmentation-Sensitivity Risk Scoring

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)