Toward Clinically Trustworthy Deep Learning: Applying Conformal Prediction to Intracranial Hemorrhage Detection
Cooper Gamble, Shahriar Faghani, Bradley J. Erickson
TL;DR
This study addresses the trust gap in deep learning for radiology by applying Mondrian conformal prediction (MCP) to intracranial hemorrhage (ICH) detection on head CTs. A YOLOv8-based detector was trained on definite, radiologist-consensus data, calibrated with MCP, and evaluated across test, challenging, negative control, and RSNA external validation sets; MCP yielded statistically guaranteed prediction sets and an impressive 99.7% accuracy in identifying challenging cases, while maintaining competitive detection performance. The approach demonstrates that uncertainty-aware DL can both match state-of-the-art accuracy and flag uncertain inputs for expert review, advancing practical deployment in radiology workflows. The work also provides an open, deployable MCP toolkit and suggests future extensions to 3D models and broader validation to support clinically trustworthy AI adoption.
Abstract
As deep learning (DL) continues to demonstrate its ability in radiological tasks, it is critical that we optimize clinical DL solutions to include safety. One of the principal concerns in the clinical adoption of DL tools is trust. This study aims to apply conformal prediction as a step toward trustworthiness for DL in radiology. This is a retrospective study of 491 non-contrast head CTs from the CQ500 dataset, in which three senior radiologists annotated slices containing intracranial hemorrhage (ICH). The dataset was split into definite and challenging subsets, where challenging images were defined to those in which there was disagreement among readers. A DL model was trained on 146 patients (10,815 slices) from the definite data (training dataset) to perform ICH localization and classification for five classes of ICH. To develop an uncertainty-aware DL model, 1,546 cases of the definite data (calibration dataset) was used for Mondrian conformal prediction (MCP). The uncertainty-aware DL model was tested on 8,401 definite and challenging cases to assess its ability to identify challenging cases. After the MCP procedure, the model achieved an F1 score of 0.920 for ICH classification on the test dataset. Additionally, it correctly identified 6,837 of the 6,856 total challenging cases as challenging (99.7% accuracy). It did not incorrectly label any definite cases as challenging. The uncertainty-aware ICH detector performs on par with state-of-the-art models. MCP's performance in detecting challenging cases demonstrates that it is useful in automated ICH detection and promising for trustworthiness in radiological DL.
