Implementing Trust in Non-Small Cell Lung Cancer Diagnosis with a Conformalized Uncertainty-Aware AI Framework in Whole-Slide Images

Xiaoge Zhang; Tao Wang; Chao Yan; Fedaa Najdawi; Kai Zhou; Yuan Ma; Yiu-ming Cheung; Bradley A. Malin

Implementing Trust in Non-Small Cell Lung Cancer Diagnosis with a Conformalized Uncertainty-Aware AI Framework in Whole-Slide Images

Xiaoge Zhang, Tao Wang, Chao Yan, Fedaa Najdawi, Kai Zhou, Yuan Ma, Yiu-ming Cheung, Bradley A. Malin

TL;DR

It is illustrated that an AI model wrapped with TRUECAM significantly outperforms models that lack such guidance, in terms of classification accuracy, robustness, interpretability, and data efficiency, while also achieving improvements in fairness.

Abstract

Ensuring trustworthiness is fundamental to the development of artificial intelligence (AI) that is considered societally responsible, particularly in cancer diagnostics, where a misdiagnosis can have dire consequences. Current digital pathology AI models lack systematic solutions to address trustworthiness concerns arising from model limitations and data discrepancies between model deployment and development environments. To address this issue, we developed TRUECAM, a framework designed to ensure both data and model trustworthiness in non-small cell lung cancer subtyping with whole-slide images. TRUECAM integrates 1) a spectral-normalized neural Gaussian process for identifying out-of-scope inputs and 2) an ambiguity-guided elimination of tiles to filter out highly ambiguous regions, addressing data trustworthiness, as well as 3) conformal prediction to ensure controlled error rates. We systematically evaluated the framework across multiple large-scale cancer datasets, leveraging both task-specific and foundation models, illustrate that an AI model wrapped with TRUECAM significantly outperforms models that lack such guidance, in terms of classification accuracy, robustness, interpretability, and data efficiency, while also achieving improvements in fairness. These findings highlight TRUECAM as a versatile wrapper framework for digital pathology AI models with diverse architectural designs, promoting their responsible and effective applications in real-world settings.

Implementing Trust in Non-Small Cell Lung Cancer Diagnosis with a Conformalized Uncertainty-Aware AI Framework in Whole-Slide Images

TL;DR

Abstract

Paper Structure (25 sections, 10 equations, 16 figures)

This paper contains 25 sections, 10 equations, 16 figures.

Main
Results
Discussions
Method
Data Availability
Code availability
Acknowledgments
Author contributions statement

Figures (16)

Figure 1:
Figure 1: Overview of TRUECAM. TRUECAM is a versatile, model-agnostic digital pathology AI framework for reliable non-small-cell lung cancer (NSCLC) subtyping, achieving trustworthiness by substantially reducing errors, detecting OOD data and controlling distribution shifts pre-inference, identifying and eliminating ambiguous slide regions, and ensuring true label coverage with statistical guarantees via abstention on uncertain inputs. a, The architecture of TRUECAM designed to ensure both data and model trustworthiness. b, Customization illustration of TRUECAM for deep learning models of varying architecture, complexity, and purpose, including Inception-v3, UNI, and CONCH. c, Illustration of eliminating ambiguous tiles in TRUECAM for slide inference. d, Overview of the TCGA and CPTAC NSCLC datasets, as well as the out-of-domain dataset built from other cancer types within TCGA. e, TRUECAM significantly reduced NSCLC subtyping error rates across all model types (denoted using suffix "-TRUECAM") compared to their original deterministic versions (denoted using suffix "-D"), adhering to the pre-specified true label coverage 1-$\alpha$. f, Patient-level classification breakdown for models with and without TRUECAM. g, TRUECAM achieved significantly lower error rates in real-world NSCLC subtyping scenarios involving a 1:1 mix of in-domain and out-of-domain inputs. Results shown in e-g are based on the TCGA dataset. See Extended Data Fig. \ref{['fig:performance-summary-cptac']} for evaluations based on the CPTAC dataset and two other foundation models (Prov-GigaPath and TITAN). All mean values and 95% confidence intervals are based on 20 independently trained models, each with 500 conformal prediction evaluations. OOD, out-of-domain; In-D, in-domain; SNGP, spectral-normalized neural Gaussian process; EAT, elimination of ambiguous tiles; WSI, whole-slide image; LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; BLCA, bladder urothelial carcinoma; USC, uterine carcinosarcoma; UVM, uveal melanoma; ACC, adrenocortical carcinoma; Incep, Inception-v3; D, Deterministic.
Figure 1: Extended Data Figure 1: Evaluation of TRUECAM's performance across four foundation models: UNI, CONCH, Prov-GigaPath, and TITAN. a, d, g, TRUECAM significantly reduced NSCLC subtyping error rates across model types (denoted using suffix "-TRUECAM") compared to their original deterministic versions (denoted using suffix "-D"), adhering to the pre-specified true label coverage 1-$\alpha$. The only exception is TITAN-TRUECAM with $1-\alpha=0.95$, where TITAN-D achieved an error rate below 0.05. As a result, CP produced a zero prediction set size (no prediction) to maintain the desired coverage of 0.95. b, e, h, Patient-level classification breakdown for models with and without TRUECAM. c, f, i, TRUECAM achieved significantly lower error rates in real-world NSCLC subtyping scenarios involving a 1:1 mix of in-domain and out-of-domain inputs. Evaluations in a-c and g-i are based on the CPTAC testing data, whereas those in d-f are based on the TCGA testing data. Results shown in a-c are based on 20 independently trained models, each with 500 conformal prediction evaluations, and those in d-i are based on 20 independently trained models. OOD, out-of-domain; D, Deterministic.
Figure 2: NSCLC subtyping performance of three Inception-v3-based deep neural network models and their conformalized counterparts. a, Tile-level performance with respect to the TCGA dataset in terms of classification accuracy and area under receiver operator curve (AUROC). b, Tile-level performance evaluated on the CPTAC dataset. c, Patient-level performance evaluated on the TCGA dataset. d, Patient-level performance evaluated on the CPTAC dataset. e, Tile-level prediction set size on TCGA for three distinct values of error level $\alpha$. f, Tile-level prediction set size on CPTAC. g, Patient-level prediction set size on TCGA. h, Patient-level prediction set size on CPTAC. i, Patient-level classification breakdown for three conformalized models on the TCGA testing dataset. j, Patient-level classification breakdown on CPTAC. k, Patient-level DA error rate on TCGA before and after activating CP. l, Patient-level DA error rate on CPTAC before and after activating CP. Mean values and the corresponding 95% confidence intervals in a-d are derived from 20 independently trained models, evaluated using a combination of calibration and testing data. e-l are based on 20 independently trained models, each with 500 CP evaluations (See details in the Methods). One-sided Wilcoxon signed-rank test is utilized to calculate p values. *** $p<0.001$; ** $p<0.01$; * $p<0.05$. SNGP, spectral-normalized neural Gaussian process; CP, conformal prediction; DA, definitive-answer.
Figure 2: Extended Data Figure 2: Impact assessment of EAT on classification and CP performance. a, Tile-level classification performance on TCGA. b, Tile-level classification performance on CPTAC. c, Patient-level classification performance on TCGA. d, Patient-level classification performance on CPTAC. e, Tile-level prediction set size on TCGA for three distinct values of significance level $\alpha$. f, Tile-level prediction set size on CPTAC. g, Patient-level prediction set size on TCGA. h, Patient-level prediction set size on CPTAC. i, Patient-level DA error rate on TCGA before and after activating CP. j, Patient-level DA error rate on CPTAC before and after activating CP. Mean values and the corresponding 95% confidence intervals in a-j are based on 20 independently trained models, each with 500 CP evaluations (See details in the Methods). One-sided Wilcoxon tests are utilized to calculate p values. *** $p<0.001$. SNGP, spectral-normalized neural Gaussian process; RE, random elimination; EAT, elimination of ambiguous tiles; DA, definitive-answer.
...and 11 more figures

Implementing Trust in Non-Small Cell Lung Cancer Diagnosis with a Conformalized Uncertainty-Aware AI Framework in Whole-Slide Images

TL;DR

Abstract

Implementing Trust in Non-Small Cell Lung Cancer Diagnosis with a Conformalized Uncertainty-Aware AI Framework in Whole-Slide Images

Authors

TL;DR

Abstract

Table of Contents

Figures (16)