Robustness Testing of Black-Box Models Against CT Degradation Through Test-Time Augmentation

Jack Highton; Quok Zong Chong; Samuel Finestone; Arian Beqiri; Julia A. Schnabel; Kanwal K. Bhatia

Robustness Testing of Black-Box Models Against CT Degradation Through Test-Time Augmentation

Jack Highton, Quok Zong Chong, Samuel Finestone, Arian Beqiri, Julia A. Schnabel, Kanwal K. Bhatia

TL;DR

This work tackles the problem of validating robustness of black-box medical imaging models to CT quality degradation, addressing real-world distribution shifts that arise from scanner changes and artifacts. It introduces a physics-based CT simulation framework that generates acquisition-based degradations (noise, metal streaks, motion) as test-time augmentations, enabling model-agnostic robustness testing using a small, locally annotated dataset. By evaluating segmentation models (3D-UNet, nnUNet) and a detection model (RetinaNet), including converting segmentations to detection outputs for fair mAP comparisons, the study reveals architecture-specific robustness: nnUNet generally more robust to CT noise and streaks, RetinaNet more robust for nodule detection, and all models susceptible to metal-induced artifacts. The framework also yields degradation-weighted metrics and emphasizes variability (standard deviation) in performance, highlighting practical guidance for deployment and ongoing monitoring in clinical settings. Overall, the approach provides a scalable, site-specific benchmarking tool to assess and compare robustness of CT-based medical imaging models and supports safer, more trustworthy adoption of deep learning in clinical workflows.

Abstract

Deep learning models for medical image segmentation and object detection are becoming increasingly available as clinical products. However, as details are rarely provided about the training data, models may unexpectedly fail when cases differ from those in the training distribution. An approach allowing potential users to independently test the robustness of a model, treating it as a black box and using only a few cases from their own site, is key for adoption. To address this, a method to test the robustness of these models against CT image quality variation is presented. In this work we present this framework by demonstrating that given the same training data, the model architecture and data pre processing greatly affect the robustness of several frequently used segmentation and object detection methods to simulated CT imaging artifacts and degradation. Our framework also addresses the concern about the sustainability of deep learning models in clinical use, by considering future shifts in image quality due to scanner deterioration or imaging protocol changes which are not reflected in a limited local test dataset.

Robustness Testing of Black-Box Models Against CT Degradation Through Test-Time Augmentation

TL;DR

Abstract

Paper Structure (34 sections, 7 equations, 18 figures, 4 tables)

This paper contains 34 sections, 7 equations, 18 figures, 4 tables.

Introduction
Related work
Identifying OOD images with knowledge of the model and training data
Testing robustness against OOD images from benchmark datasets
Testing robustness against images with acquisition based augmentations
Testing robustness against OOD images with limited datasets
Contributions
Materials and metrics
Datasets
LUNA 16
Segmentation decathlon - liver task
Segmentation model architectures and training
3D-UNet
nnUNet
Object detection model architectures and training
...and 19 more sections

Figures (18)

Figure 1: An overview of the proposed framework for robustness testing. The potential user of a black-box model wishes to test its robustness to CT image degradation using a small locally acquired test dataset, annotated with segmentation maps or object boxes outlined by an expert. The performance of the model is evaluated after various simulation-based augmentations are applied to the input image (center). Then the evaluation results, alongside information from the test dataset noise distribution, are used to calculate robustness metrics for the model.
Figure 2: The geometry parameters of the simulated CT system: $\phi$ beam angle, d1 source to image center distance, d2 image center to array distance, ddet distance between adjacent detectors, and ndet is the number of detectors.
Figure 3: Outline of the CT noise simulation tuning process. Step 1: The image intensity is converted from HU to attenuation units before denoising and noise extraction via the total variation method. Step 2: the simulation parameters Q0, $\sigma$, and n$\theta$ are selected to maximize the similarity in the radial noise power spectrum (NPS) patchwise between the noise generated by the simulation and the noise extracted from the original image, as described in Table \ref{['fig:search']}. Step 3: To increase the level of noise, the incident flux Q0 is decreased until a desired augmented image noise level is reached, as described in Section \ref{['noise_gen']}.
Figure 4: The CT noise augmentation is performed by reducing the the simulation parameter Q0, to the value resulting in the output images having an increased noise component with a given standard deviation (s.d.).
Figure 5: The radial noise power spectrum (NPS) is sampled from the noise extracted from the original image within a central patch (left) for comparison with the same patch of the simulated noise fields. The normalized NPS's are shown to have similar profiles (right).
...and 13 more figures

Robustness Testing of Black-Box Models Against CT Degradation Through Test-Time Augmentation

TL;DR

Abstract

Robustness Testing of Black-Box Models Against CT Degradation Through Test-Time Augmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (18)