Table of Contents
Fetching ...

Uncovering Overconfident Failures in CXR Models via Augmentation-Sensitivity Risk Scoring

Han-Jay Shu, Wei-Ning Chiu, Shun-Ting Chang, Meng-Ping Huang, Takeshi Tohyama, Ahram Han, Po-Chih Kuo

TL;DR

This work addresses hidden failures in CXR AI by proposing augmentation-sensitivity risk scoring (ASRS), a label-free framework that detects error-prone cases via representation instability under small rotations. Using RAD-DINO embeddings, ASRS computes $s(x) = \sum_{t \in \mathcal{T}} \lVert z_t - z_0 \rVert_2$ with rotations $\mathcal{T} = {\pm 15^{\circ}, \pm 30^{\circ}}$, and partitions test samples into quartiles G1–G4 based on validation thresholds. Across four diagnostic tasks (Cardiomegaly, Edema, Pneumothorax, Pleural Effusion) and three encoders on MIMIC-CXR-JPG, highly rotation-sensitive cases show a substantial recall deficit ($\approx 0.25$–$0.30$) despite high AUROC and confidence, exposing a hidden failure mode not detected by traditional metrics. ASRS enables selective prediction by auto-accepting stable cases and flagging unstable ones for clinician review, offering a practical path to safer, fairer deployment of medical AI in CXRs.

Abstract

Deep learning models achieve strong performance in chest radiograph (CXR) interpretation, yet fairness and reliability concerns persist. Models often show uneven accuracy across patient subgroups, leading to hidden failures not reflected in aggregate metrics. Existing error detection approaches -- based on confidence calibration or out-of-distribution (OOD) detection -- struggle with subtle within-distribution errors, while image- and representation-level consistency-based methods remain underexplored in medical imaging. We propose an augmentation-sensitivity risk scoring (ASRS) framework to identify error-prone CXR cases. ASRS applies clinically plausible rotations ($\pm 15^\circ$/$\pm 30^\circ$) and measures embedding shifts with the RAD-DINO encoder. Sensitivity scores stratify samples into stability quartiles, where highly sensitive cases show substantially lower recall ($-0.2$ to $-0.3$) despite high AUROC and confidence. ASRS provides a label-free means for selective prediction and clinician review, improving fairness and safety in medical AI.

Uncovering Overconfident Failures in CXR Models via Augmentation-Sensitivity Risk Scoring

TL;DR

This work addresses hidden failures in CXR AI by proposing augmentation-sensitivity risk scoring (ASRS), a label-free framework that detects error-prone cases via representation instability under small rotations. Using RAD-DINO embeddings, ASRS computes with rotations , and partitions test samples into quartiles G1–G4 based on validation thresholds. Across four diagnostic tasks (Cardiomegaly, Edema, Pneumothorax, Pleural Effusion) and three encoders on MIMIC-CXR-JPG, highly rotation-sensitive cases show a substantial recall deficit () despite high AUROC and confidence, exposing a hidden failure mode not detected by traditional metrics. ASRS enables selective prediction by auto-accepting stable cases and flagging unstable ones for clinician review, offering a practical path to safer, fairer deployment of medical AI in CXRs.

Abstract

Deep learning models achieve strong performance in chest radiograph (CXR) interpretation, yet fairness and reliability concerns persist. Models often show uneven accuracy across patient subgroups, leading to hidden failures not reflected in aggregate metrics. Existing error detection approaches -- based on confidence calibration or out-of-distribution (OOD) detection -- struggle with subtle within-distribution errors, while image- and representation-level consistency-based methods remain underexplored in medical imaging. We propose an augmentation-sensitivity risk scoring (ASRS) framework to identify error-prone CXR cases. ASRS applies clinically plausible rotations (/) and measures embedding shifts with the RAD-DINO encoder. Sensitivity scores stratify samples into stability quartiles, where highly sensitive cases show substantially lower recall ( to ) despite high AUROC and confidence. ASRS provides a label-free means for selective prediction and clinician review, improving fairness and safety in medical AI.

Paper Structure

This paper contains 11 sections, 4 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Overview of the proposed methodology, illustrating the pipeline from ASRS computation using RAD-DINO embeddings, validation-anchored grouping into G1--G4, downstream evaluation on four diagnostic tasks (pneumothorax, cardiomegaly, pleural effusion, edema) with multiple encoders (RAD-DINO, ResNet50, CXR-MAE), to stratified performance and confidence analysis.