Table of Contents
Fetching ...

H-CNN-ViT: A Hierarchical Gated Attention Multi-Branch Model for Bladder Cancer Recurrence Prediction

Xueyang Li, Zongren Wang, Yuliang Zhang, Zixuan Pan, Yu-Jen Chen, Nishchal Sapkota, Gelei Xu, Danny Z. Chen, Yiyu Shi

TL;DR

This work tackles bladder cancer recurrence prediction using post-operative multi-sequence MRI by introducing a curated, multi-modal dataset and a novel H-CNN-ViT architecture. The model employs Dual-Path Attention blocks that fuse global (ViT) and local (CNN) features within each modality, combined with a two-tier gating mechanism (Local and Global GAM) to balance intra- and inter-branch information across ADC, T2, DWI, and clinical data. On the proposed dataset, H-CNN-ViT achieves an AUC of $78.6\%$, outperforming strong CNN, ViT, and hybrid baselines with statistically significant improvements, and analyses validate the importance of hierarchical gated attention and multi-branch design. The authors also plan to release the dataset and code to facilitate further research in high-dimensional, heterogeneous medical imaging tasks and improve post-operative surveillance in bladder cancer care.

Abstract

Bladder cancer is one of the most prevalent malignancies worldwide, with a recurrence rate of up to 78%, necessitating accurate post-operative monitoring for effective patient management. Multi-sequence contrast-enhanced MRI is commonly used for recurrence detection; however, interpreting these scans remains challenging, even for experienced radiologists, due to post-surgical alterations such as scarring, swelling, and tissue remodeling. AI-assisted diagnostic tools have shown promise in improving bladder cancer recurrence prediction, yet progress in this field is hindered by the lack of dedicated multi-sequence MRI datasets for recurrence assessment study. In this work, we first introduce a curated multi-sequence, multi-modal MRI dataset specifically designed for bladder cancer recurrence prediction, establishing a valuable benchmark for future research. We then propose H-CNN-ViT, a new Hierarchical Gated Attention Multi-Branch model that enables selective weighting of features from the global (ViT) and local (CNN) paths based on contextual demands, achieving a balanced and targeted feature fusion. Our multi-branch architecture processes each modality independently, ensuring that the unique properties of each imaging channel are optimally captured and integrated. Evaluated on our dataset, H-CNN-ViT achieves an AUC of 78.6%, surpassing state-of-the-art models. Our model is publicly available at https://github.com/XLIAaron/H-CNN-ViT.

H-CNN-ViT: A Hierarchical Gated Attention Multi-Branch Model for Bladder Cancer Recurrence Prediction

TL;DR

This work tackles bladder cancer recurrence prediction using post-operative multi-sequence MRI by introducing a curated, multi-modal dataset and a novel H-CNN-ViT architecture. The model employs Dual-Path Attention blocks that fuse global (ViT) and local (CNN) features within each modality, combined with a two-tier gating mechanism (Local and Global GAM) to balance intra- and inter-branch information across ADC, T2, DWI, and clinical data. On the proposed dataset, H-CNN-ViT achieves an AUC of , outperforming strong CNN, ViT, and hybrid baselines with statistically significant improvements, and analyses validate the importance of hierarchical gated attention and multi-branch design. The authors also plan to release the dataset and code to facilitate further research in high-dimensional, heterogeneous medical imaging tasks and improve post-operative surveillance in bladder cancer care.

Abstract

Bladder cancer is one of the most prevalent malignancies worldwide, with a recurrence rate of up to 78%, necessitating accurate post-operative monitoring for effective patient management. Multi-sequence contrast-enhanced MRI is commonly used for recurrence detection; however, interpreting these scans remains challenging, even for experienced radiologists, due to post-surgical alterations such as scarring, swelling, and tissue remodeling. AI-assisted diagnostic tools have shown promise in improving bladder cancer recurrence prediction, yet progress in this field is hindered by the lack of dedicated multi-sequence MRI datasets for recurrence assessment study. In this work, we first introduce a curated multi-sequence, multi-modal MRI dataset specifically designed for bladder cancer recurrence prediction, establishing a valuable benchmark for future research. We then propose H-CNN-ViT, a new Hierarchical Gated Attention Multi-Branch model that enables selective weighting of features from the global (ViT) and local (CNN) paths based on contextual demands, achieving a balanced and targeted feature fusion. Our multi-branch architecture processes each modality independently, ensuring that the unique properties of each imaging channel are optimally captured and integrated. Evaluated on our dataset, H-CNN-ViT achieves an AUC of 78.6%, surpassing state-of-the-art models. Our model is publicly available at https://github.com/XLIAaron/H-CNN-ViT.

Paper Structure

This paper contains 15 sections, 7 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparison of multi-sequence slices from our bladder cancer dataset and the BraTS dataset menze2014multimodal to illustrate the distinct contrasts across the sequences in each dataset. (a) Consecutive slices from the ADC, T2, and DWI sequences in our bladder cancer dataset; (b) consecutive slices from the FLAIR, T1, and T2 sequences in the BraTS dataset.
  • Figure 2: (a) The overall framework of H-CNN-ViT, which includes three branches with Dual-Path Attention (DPA) blocks for different MRI sequences (ADC, T2, and DWI), an MLP (Multi-Layer Perceptron) Encoder for clinical data, and a Global Gated Attention Module (Global GAM) for integrating the extracted features ($y_{ADC}$, $y_{T2}$, $y_{DWI}$, and $y_{Clinic}$) from each branch. The output from the Global GAM, $Y$, is passed to a classification head for final prediction. (b) The detailed architecture of the MLP Encoder for clinical data. (c) The detailed architecture of the classification head, which generates the final prediction.
  • Figure 3: (a) The detailed structure of the Dual-Path Attention (DPA) block, which comprises two parallel paths: a ViT path and a CNN path. Each path includes a 1 $\times$ 1 convolutional layer for channel transformation and a feature extractor. A Local GAM is applied to fuse the outputs of both paths. (b) The detailed architecture of the CNN extractor, designed to capture localized features.
  • Figure 4: The Local Gated Attention Module for feature fusion within the Dual-Path Attention blocks.
  • Figure 5: The leftmost image shows the original selected ADC slice used as input to the model. The middle image overlays the Class Activation Map (CAM) zhou2016learning derived from the third convolutional layer of the CNN extractor for the ADC modality. The rightmost image overlays the CAM derived from the attention mechanism in the second transformer block of the ViT extractor for the ADC modality.
  • ...and 1 more figures