Table of Contents
Fetching ...

Deep Modeling and Interpretation for Bladder Cancer Classification

Ahmad Chaddad, Yihang Wu, Xianrui Chen

TL;DR

This work tackles bladder cancer image classification by benchmarking 13 deep models (CNNs and vision transformers) across five optimizers on a multicenter dataset, with a focus on accuracy, calibration (ECE), and GradCAM++-based interpretability. It reveals substantial in-domain performance but notable cross-center generalization gaps, with ConvNeXt models often underperforming on OOD data and ViTs offering stronger calibration and interpretability in such settings. Test-time augmentation improves explanations for some models, but no model robustly handles all scenarios, highlighting the need for domain adaptation. The findings guide model selection for clinical deployment by contrasting in-distribution versus out-of-distribution requirements and emphasize calibration and interpretability as critical for real-world use.

Abstract

Deep models based on vision transformer (ViT) and convolutional neural network (CNN) have demonstrated remarkable performance on natural datasets. However, these models may not be similar in medical imaging, where abnormal regions cover only a small portion of the image. This challenge motivates this study to investigate the latest deep models for bladder cancer classification tasks. We propose the following to evaluate these deep models: 1) standard classification using 13 models (four CNNs and eight transormer-based models), 2) calibration analysis to examine if these models are well calibrated for bladder cancer classification, and 3) we use GradCAM++ to evaluate the interpretability of these models for clinical diagnosis. We simulate $\sim 300$ experiments on a publicly multicenter bladder cancer dataset, and the experimental results demonstrate that the ConvNext series indicate limited generalization ability to classify bladder cancer images (e.g., $\sim 60\%$ accuracy). In addition, ViTs show better calibration effects compared to ConvNext and swin transformer series. We also involve test time augmentation to improve the models interpretability. Finally, no model provides a one-size-fits-all solution for a feasible interpretable model. ConvNext series are suitable for in-distribution samples, while ViT and its variants are suitable for interpreting out-of-distribution samples.

Deep Modeling and Interpretation for Bladder Cancer Classification

TL;DR

This work tackles bladder cancer image classification by benchmarking 13 deep models (CNNs and vision transformers) across five optimizers on a multicenter dataset, with a focus on accuracy, calibration (ECE), and GradCAM++-based interpretability. It reveals substantial in-domain performance but notable cross-center generalization gaps, with ConvNeXt models often underperforming on OOD data and ViTs offering stronger calibration and interpretability in such settings. Test-time augmentation improves explanations for some models, but no model robustly handles all scenarios, highlighting the need for domain adaptation. The findings guide model selection for clinical deployment by contrasting in-distribution versus out-of-distribution requirements and emphasize calibration and interpretability as critical for real-world use.

Abstract

Deep models based on vision transformer (ViT) and convolutional neural network (CNN) have demonstrated remarkable performance on natural datasets. However, these models may not be similar in medical imaging, where abnormal regions cover only a small portion of the image. This challenge motivates this study to investigate the latest deep models for bladder cancer classification tasks. We propose the following to evaluate these deep models: 1) standard classification using 13 models (four CNNs and eight transormer-based models), 2) calibration analysis to examine if these models are well calibrated for bladder cancer classification, and 3) we use GradCAM++ to evaluate the interpretability of these models for clinical diagnosis. We simulate experiments on a publicly multicenter bladder cancer dataset, and the experimental results demonstrate that the ConvNext series indicate limited generalization ability to classify bladder cancer images (e.g., accuracy). In addition, ViTs show better calibration effects compared to ConvNext and swin transformer series. We also involve test time augmentation to improve the models interpretability. Finally, no model provides a one-size-fits-all solution for a feasible interpretable model. ConvNext series are suitable for in-distribution samples, while ViT and its variants are suitable for interpreting out-of-distribution samples.
Paper Structure (14 sections, 9 figures, 1 table)

This paper contains 14 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: Flowchart illustrates the comparative analysis workflow for 13 models. The pipeline begins with images collection, followed by data preprocessing and model training. Finally, computation of performance metrics for each model to determine the most effective model in different situations.
  • Figure 2: Example of MIBC and NMIBC samples in each center.
  • Figure 3: Spider plot of test metrics using SGD (First row), Adam (Second row), AdamW (Third row), Adagrad (Fouth row) and Adadelta (Fifth row) on Bladder dataset with 13 deep models.
  • Figure 4: Line chart of validation metrics using 13 deep models on Bladder dataset with SGD optimizer.
  • Figure 5: Reliability plot with ECE values for 13 deep models on Bladder dataset using Adagrad optimizer. The first row shows the results of CN_b, CN_l, CN_s, CN_t and Maxvit_t, the second row indicates the results of Swin_b, Swin_s, Swin_t, Swin_v2_b and Swin_v2_s, while the last row represents the results of Swin_v2_t, Vit_l_16 and Vit_h_14.
  • ...and 4 more figures