Bladder Vessel Segmentation using a Hybrid Attention-Convolution Framework
Franziska Krauß, Matthias Ege, Zoltan Lovasz, Albrecht Bartz-Schmidt, Igor Tsaur, Oliver Sawodny, Carina Veil
TL;DR
This paper tackles the problem of reliable intraoperative bladder navigation by segmenting visible vascular structures, which serve as patient-specific landmarks amidst deformable, artifact-prone cystoscopic imaging. It introduces Hybrid Attention-Convolution (HAC), a two-branch architecture that fuses a Transformer-based global topology backbone with a CNN-based refinement to preserve thin-vessel details, guided by a topology-aware prior $P_A$ and a residual map $P_U$ so that $P_{HAC} = P_A + P_U$. The training workflow combines physics-aware self-supervised pretraining, topology learning on optimized ground-truth targets, and a three-stage supervised refinement, augmented by Self-Supervised Consistency Regularization on unlabeled data. On the BlaVeS dataset, HAC achieves competitive accuracy ($0.94$), high precision ($0.61$), and strong topological consistency (clDice $0.66$), while effectively suppressing false positives from folds and artifacts, enabling more reliable intraoperative navigation. The work advances toward a patient-specific bladder digital twin by enabling stable vascular landmark extraction and paves the way for frame-to-frame vascular matching for continuous navigation across interventions.
Abstract
Urinary bladder cancer surveillance requires tracking tumor sites across repeated interventions, yet the deformable and hollow bladder lacks stable landmarks for orientation. While blood vessels visible during endoscopy offer a patient-specific "vascular fingerprint" for navigation, automated segmentation is challenged by imperfect endoscopic data, including sparse labels, artifacts like bubbles or variable lighting, continuous deformation, and mucosal folds that mimic vessels. State-of-the-art vessel segmentation methods often fail to address these domain-specific complexities. We introduce a Hybrid Attention-Convolution (HAC) architecture that combines Transformers to capture global vessel topology prior with a CNN that learns a residual refinement map to precisely recover thin-vessel details. To prioritize structural connectivity, the Transformer is trained on optimized ground truth data that exclude short and terminal branches. Furthermore, to address data scarcity, we employ a physics-aware pretraining, that is a self-supervised strategy using clinically grounded augmentations on unlabeled data. Evaluated on the BlaVeS dataset, consisting of endoscopic video frames, our approach achieves high accuracy (0.94) and superior precision (0.61) and clDice (0.66) compared to state-of-the-art medical segmentation models. Crucially, our method successfully suppresses false positives from mucosal folds that dynamically appear and vanish as the bladder fills and empties during surgery. Hence, HAC provides the reliable structural stability required for clinical navigation.
