H-CNN-ViT: A Hierarchical Gated Attention Multi-Branch Model for Bladder Cancer Recurrence Prediction
Xueyang Li, Zongren Wang, Yuliang Zhang, Zixuan Pan, Yu-Jen Chen, Nishchal Sapkota, Gelei Xu, Danny Z. Chen, Yiyu Shi
TL;DR
This work tackles bladder cancer recurrence prediction using post-operative multi-sequence MRI by introducing a curated, multi-modal dataset and a novel H-CNN-ViT architecture. The model employs Dual-Path Attention blocks that fuse global (ViT) and local (CNN) features within each modality, combined with a two-tier gating mechanism (Local and Global GAM) to balance intra- and inter-branch information across ADC, T2, DWI, and clinical data. On the proposed dataset, H-CNN-ViT achieves an AUC of $78.6\%$, outperforming strong CNN, ViT, and hybrid baselines with statistically significant improvements, and analyses validate the importance of hierarchical gated attention and multi-branch design. The authors also plan to release the dataset and code to facilitate further research in high-dimensional, heterogeneous medical imaging tasks and improve post-operative surveillance in bladder cancer care.
Abstract
Bladder cancer is one of the most prevalent malignancies worldwide, with a recurrence rate of up to 78%, necessitating accurate post-operative monitoring for effective patient management. Multi-sequence contrast-enhanced MRI is commonly used for recurrence detection; however, interpreting these scans remains challenging, even for experienced radiologists, due to post-surgical alterations such as scarring, swelling, and tissue remodeling. AI-assisted diagnostic tools have shown promise in improving bladder cancer recurrence prediction, yet progress in this field is hindered by the lack of dedicated multi-sequence MRI datasets for recurrence assessment study. In this work, we first introduce a curated multi-sequence, multi-modal MRI dataset specifically designed for bladder cancer recurrence prediction, establishing a valuable benchmark for future research. We then propose H-CNN-ViT, a new Hierarchical Gated Attention Multi-Branch model that enables selective weighting of features from the global (ViT) and local (CNN) paths based on contextual demands, achieving a balanced and targeted feature fusion. Our multi-branch architecture processes each modality independently, ensuring that the unique properties of each imaging channel are optimally captured and integrated. Evaluated on our dataset, H-CNN-ViT achieves an AUC of 78.6%, surpassing state-of-the-art models. Our model is publicly available at https://github.com/XLIAaron/H-CNN-ViT.
