Table of Contents
Fetching ...

Enhancing Multi-Label Thoracic Disease Diagnosis with Deep Ensemble-Based Uncertainty Quantification

Yasiru Laksara, Uthayasanker Thayasivam

TL;DR

This work tackles the inadequacy of deterministic predictions in high-stakes thoracic disease diagnosis by embedding uncertainty quantification into a multi-label chest X-ray classifier. A two-path development process first tested Monte Carlo Dropout but found it degraded performance and calibration, prompting a shift to a diverse 9-member Deep Ensemble that combines DenseNet, EfficientNet, and CBAM-based backbones with specialized loss functions. The Deep Ensemble achieved state-of-the-art average AUROC (0.8559) and strong calibration (ECE 0.0728, NLL 0.1916), while enabling a decomposition of total uncertainty into Aleatoric and Epistemic components (mean EU 0.0240). Ensemble Grad-CAM visualizations provide interpretable, consensus heatmaps, supporting clinical trust and decision-making. The study highlights that data uncertainty dominates remaining errors, suggesting future gains require higher-quality, clinician-validated data and external validation across diverse datasets to confirm generalizability and reliability in real-world use.

Abstract

The utility of deep learning models, such as CheXNet, in high stakes clinical settings is fundamentally constrained by their purely deterministic nature, failing to provide reliable measures of predictive confidence. This project addresses this critical gap by integrating robust Uncertainty Quantification (UQ) into a high performance diagnostic platform for 14 common thoracic diseases on the NIH ChestX-ray14 dataset. Initial architectural development failed to stabilize performance and calibration using Monte Carlo Dropout (MCD), yielding an unacceptable Expected Calibration Error (ECE) of 0.7588. This technical failure necessitated a rigorous architectural pivot to a high diversity, 9-member Deep Ensemble (DE). This resulting DE successfully stabilized performance and delivered superior reliability, achieving a State-of-the-Art (SOTA) average Area Under the Receiver Operating Characteristic Curve (AUROC) of 0.8559 and an average F1 Score of 0.3857. Crucially, the DE demonstrated superior calibration (Mean ECE of 0.0728 and Negative Log-Likelihood (NLL) of 0.1916) and enabled the reliable decomposition of total uncertainty into its Aleatoric (irreducible data noise) and Epistemic (reducible model knowledge) components, with a mean Epistemic Uncertainty (EU) of 0.0240. These results establish the Deep Ensemble as a trustworthy and explainable platform, transforming the model from a probabilistic tool into a reliable clinical decision support system.

Enhancing Multi-Label Thoracic Disease Diagnosis with Deep Ensemble-Based Uncertainty Quantification

TL;DR

This work tackles the inadequacy of deterministic predictions in high-stakes thoracic disease diagnosis by embedding uncertainty quantification into a multi-label chest X-ray classifier. A two-path development process first tested Monte Carlo Dropout but found it degraded performance and calibration, prompting a shift to a diverse 9-member Deep Ensemble that combines DenseNet, EfficientNet, and CBAM-based backbones with specialized loss functions. The Deep Ensemble achieved state-of-the-art average AUROC (0.8559) and strong calibration (ECE 0.0728, NLL 0.1916), while enabling a decomposition of total uncertainty into Aleatoric and Epistemic components (mean EU 0.0240). Ensemble Grad-CAM visualizations provide interpretable, consensus heatmaps, supporting clinical trust and decision-making. The study highlights that data uncertainty dominates remaining errors, suggesting future gains require higher-quality, clinician-validated data and external validation across diverse datasets to confirm generalizability and reliability in real-world use.

Abstract

The utility of deep learning models, such as CheXNet, in high stakes clinical settings is fundamentally constrained by their purely deterministic nature, failing to provide reliable measures of predictive confidence. This project addresses this critical gap by integrating robust Uncertainty Quantification (UQ) into a high performance diagnostic platform for 14 common thoracic diseases on the NIH ChestX-ray14 dataset. Initial architectural development failed to stabilize performance and calibration using Monte Carlo Dropout (MCD), yielding an unacceptable Expected Calibration Error (ECE) of 0.7588. This technical failure necessitated a rigorous architectural pivot to a high diversity, 9-member Deep Ensemble (DE). This resulting DE successfully stabilized performance and delivered superior reliability, achieving a State-of-the-Art (SOTA) average Area Under the Receiver Operating Characteristic Curve (AUROC) of 0.8559 and an average F1 Score of 0.3857. Crucially, the DE demonstrated superior calibration (Mean ECE of 0.0728 and Negative Log-Likelihood (NLL) of 0.1916) and enabled the reliable decomposition of total uncertainty into its Aleatoric (irreducible data noise) and Epistemic (reducible model knowledge) components, with a mean Epistemic Uncertainty (EU) of 0.0240. These results establish the Deep Ensemble as a trustworthy and explainable platform, transforming the model from a probabilistic tool into a reliable clinical decision support system.

Paper Structure

This paper contains 24 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of the methodology pipeline
  • Figure 2: Ensemble Model - Combined ROC Curves
  • Figure 3: Ensemble Model - Calibration Analysis (ECE)
  • Figure 4: Example Ensemble Grad-CAM Visualization for Mass disease prediction