Table of Contents
Fetching ...

Evaluating the Impact of Radiographic Noise on Chest X-ray Semantic Segmentation and Disease Classification Using a Scalable Noise Injection Framework

Derek Jiu, Kiran Nijjer, Nishant Chinta, Ryan Bui, Kevin Zhu

TL;DR

Chest X-ray AI systems face reliability challenges from radiographic noise, which this work addresses with a scalable, noise-type–aware injection framework that simulates $Poisson$ (quantum) and $Gaussian$ (electronic) noise across a calibrated severity ladder. The authors systematically evaluate both semantic segmentation and disease classification on Landmark and NIH ChestX-ray14, unveiling a stark task-dependent dichotomy: segmentation is highly brittle under noise, especially electronic perturbations, whereas classification exhibits greater resilience but with distinct, task-specific vulnerabilities to each noise type. The study provides a reproducible benchmark and cross-task insights, showing non-monotonic degradation patterns potentially due to noise acting as a regularizer at intermediate levels. These findings inform the design of validation and mitigation strategies for safe clinical deployment of diagnostic AI in chest radiography, bridging the gap between idealized training conditions and real-world imaging variability.

Abstract

Deep learning models are increasingly used for radiographic analysis, but their reliability is challenged by the stochastic noise inherent in clinical imaging. A systematic, cross-task understanding of how different noise types impact these models is lacking. Here, we evaluate the robustness of state-of-the-art convolutional neural networks (CNNs) to simulated quantum (Poisson) and electronic (Gaussian) noise in two key chest X-ray tasks: semantic segmentation and pulmonary disease classification. Using a novel, scalable noise injection framework, we applied controlled, clinically-motivated noise severities to common architectures (UNet, DeepLabV3, FPN; ResNet, DenseNet, EfficientNet) on public datasets (Landmark, ChestX-ray14). Our results reveal a stark dichotomy in task robustness. Semantic segmentation models proved highly vulnerable, with lung segmentation performance collapsing under severe electronic noise (Dice Similarity Coefficient drop of 0.843), signifying a near-total model failure. In contrast, classification tasks demonstrated greater overall resilience, but this robustness was not uniform. We discovered a differential vulnerability: certain tasks, such as distinguishing Pneumothorax from Atelectasis, failed catastrophically under quantum noise (AUROC drop of 0.355), while others were more susceptible to electronic noise. These findings demonstrate that while classification models possess a degree of inherent robustness, pixel-level segmentation tasks are far more brittle. The task- and noise-specific nature of model failure underscores the critical need for targeted validation and mitigation strategies before the safe clinical deployment of diagnostic AI.

Evaluating the Impact of Radiographic Noise on Chest X-ray Semantic Segmentation and Disease Classification Using a Scalable Noise Injection Framework

TL;DR

Chest X-ray AI systems face reliability challenges from radiographic noise, which this work addresses with a scalable, noise-type–aware injection framework that simulates (quantum) and (electronic) noise across a calibrated severity ladder. The authors systematically evaluate both semantic segmentation and disease classification on Landmark and NIH ChestX-ray14, unveiling a stark task-dependent dichotomy: segmentation is highly brittle under noise, especially electronic perturbations, whereas classification exhibits greater resilience but with distinct, task-specific vulnerabilities to each noise type. The study provides a reproducible benchmark and cross-task insights, showing non-monotonic degradation patterns potentially due to noise acting as a regularizer at intermediate levels. These findings inform the design of validation and mitigation strategies for safe clinical deployment of diagnostic AI in chest radiography, bridging the gap between idealized training conditions and real-world imaging variability.

Abstract

Deep learning models are increasingly used for radiographic analysis, but their reliability is challenged by the stochastic noise inherent in clinical imaging. A systematic, cross-task understanding of how different noise types impact these models is lacking. Here, we evaluate the robustness of state-of-the-art convolutional neural networks (CNNs) to simulated quantum (Poisson) and electronic (Gaussian) noise in two key chest X-ray tasks: semantic segmentation and pulmonary disease classification. Using a novel, scalable noise injection framework, we applied controlled, clinically-motivated noise severities to common architectures (UNet, DeepLabV3, FPN; ResNet, DenseNet, EfficientNet) on public datasets (Landmark, ChestX-ray14). Our results reveal a stark dichotomy in task robustness. Semantic segmentation models proved highly vulnerable, with lung segmentation performance collapsing under severe electronic noise (Dice Similarity Coefficient drop of 0.843), signifying a near-total model failure. In contrast, classification tasks demonstrated greater overall resilience, but this robustness was not uniform. We discovered a differential vulnerability: certain tasks, such as distinguishing Pneumothorax from Atelectasis, failed catastrophically under quantum noise (AUROC drop of 0.355), while others were more susceptible to electronic noise. These findings demonstrate that while classification models possess a degree of inherent robustness, pixel-level segmentation tasks are far more brittle. The task- and noise-specific nature of model failure underscores the critical need for targeted validation and mitigation strategies before the safe clinical deployment of diagnostic AI.

Paper Structure

This paper contains 38 sections, 16 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: An original chest radiograph (left) is shown with the addition of maximum-intensity quantum noise ($s_q$ = 10, middle) and electronic noise ($s_e$ = 10, right) to demonstrate their distinct visual textures. Quantum noise produces a blotchy, signal-dependent mottle, while electronic noise introduces a fine-grained, uniform static.
  • Figure 2: Average classification performance (AUPRC, AUROC, F1) is plotted against quantum ($s_q$) and electronic ($s_e$) noise severity. Two representative disease pairs from each category (visually similar and distinct) are shown to illustrate the differential impact of noise, where model degradation is highly dependent on both the clinical task and the noise modality. While all three metrics are displayed for completeness, the discussion in the main text focuses on AUROC as the primary metric.