Robustness Testing of Black-Box Models Against CT Degradation Through Test-Time Augmentation
Jack Highton, Quok Zong Chong, Samuel Finestone, Arian Beqiri, Julia A. Schnabel, Kanwal K. Bhatia
TL;DR
This work tackles the problem of validating robustness of black-box medical imaging models to CT quality degradation, addressing real-world distribution shifts that arise from scanner changes and artifacts. It introduces a physics-based CT simulation framework that generates acquisition-based degradations (noise, metal streaks, motion) as test-time augmentations, enabling model-agnostic robustness testing using a small, locally annotated dataset. By evaluating segmentation models (3D-UNet, nnUNet) and a detection model (RetinaNet), including converting segmentations to detection outputs for fair mAP comparisons, the study reveals architecture-specific robustness: nnUNet generally more robust to CT noise and streaks, RetinaNet more robust for nodule detection, and all models susceptible to metal-induced artifacts. The framework also yields degradation-weighted metrics and emphasizes variability (standard deviation) in performance, highlighting practical guidance for deployment and ongoing monitoring in clinical settings. Overall, the approach provides a scalable, site-specific benchmarking tool to assess and compare robustness of CT-based medical imaging models and supports safer, more trustworthy adoption of deep learning in clinical workflows.
Abstract
Deep learning models for medical image segmentation and object detection are becoming increasingly available as clinical products. However, as details are rarely provided about the training data, models may unexpectedly fail when cases differ from those in the training distribution. An approach allowing potential users to independently test the robustness of a model, treating it as a black box and using only a few cases from their own site, is key for adoption. To address this, a method to test the robustness of these models against CT image quality variation is presented. In this work we present this framework by demonstrating that given the same training data, the model architecture and data pre processing greatly affect the robustness of several frequently used segmentation and object detection methods to simulated CT imaging artifacts and degradation. Our framework also addresses the concern about the sustainability of deep learning models in clinical use, by considering future shifts in image quality due to scanner deterioration or imaging protocol changes which are not reflected in a limited local test dataset.
