Table of Contents
Fetching ...

Bladder Vessel Segmentation using a Hybrid Attention-Convolution Framework

Franziska Krauß, Matthias Ege, Zoltan Lovasz, Albrecht Bartz-Schmidt, Igor Tsaur, Oliver Sawodny, Carina Veil

TL;DR

This paper tackles the problem of reliable intraoperative bladder navigation by segmenting visible vascular structures, which serve as patient-specific landmarks amidst deformable, artifact-prone cystoscopic imaging. It introduces Hybrid Attention-Convolution (HAC), a two-branch architecture that fuses a Transformer-based global topology backbone with a CNN-based refinement to preserve thin-vessel details, guided by a topology-aware prior $P_A$ and a residual map $P_U$ so that $P_{HAC} = P_A + P_U$. The training workflow combines physics-aware self-supervised pretraining, topology learning on optimized ground-truth targets, and a three-stage supervised refinement, augmented by Self-Supervised Consistency Regularization on unlabeled data. On the BlaVeS dataset, HAC achieves competitive accuracy ($0.94$), high precision ($0.61$), and strong topological consistency (clDice $0.66$), while effectively suppressing false positives from folds and artifacts, enabling more reliable intraoperative navigation. The work advances toward a patient-specific bladder digital twin by enabling stable vascular landmark extraction and paves the way for frame-to-frame vascular matching for continuous navigation across interventions.

Abstract

Urinary bladder cancer surveillance requires tracking tumor sites across repeated interventions, yet the deformable and hollow bladder lacks stable landmarks for orientation. While blood vessels visible during endoscopy offer a patient-specific "vascular fingerprint" for navigation, automated segmentation is challenged by imperfect endoscopic data, including sparse labels, artifacts like bubbles or variable lighting, continuous deformation, and mucosal folds that mimic vessels. State-of-the-art vessel segmentation methods often fail to address these domain-specific complexities. We introduce a Hybrid Attention-Convolution (HAC) architecture that combines Transformers to capture global vessel topology prior with a CNN that learns a residual refinement map to precisely recover thin-vessel details. To prioritize structural connectivity, the Transformer is trained on optimized ground truth data that exclude short and terminal branches. Furthermore, to address data scarcity, we employ a physics-aware pretraining, that is a self-supervised strategy using clinically grounded augmentations on unlabeled data. Evaluated on the BlaVeS dataset, consisting of endoscopic video frames, our approach achieves high accuracy (0.94) and superior precision (0.61) and clDice (0.66) compared to state-of-the-art medical segmentation models. Crucially, our method successfully suppresses false positives from mucosal folds that dynamically appear and vanish as the bladder fills and empties during surgery. Hence, HAC provides the reliable structural stability required for clinical navigation.

Bladder Vessel Segmentation using a Hybrid Attention-Convolution Framework

TL;DR

This paper tackles the problem of reliable intraoperative bladder navigation by segmenting visible vascular structures, which serve as patient-specific landmarks amidst deformable, artifact-prone cystoscopic imaging. It introduces Hybrid Attention-Convolution (HAC), a two-branch architecture that fuses a Transformer-based global topology backbone with a CNN-based refinement to preserve thin-vessel details, guided by a topology-aware prior and a residual map so that . The training workflow combines physics-aware self-supervised pretraining, topology learning on optimized ground-truth targets, and a three-stage supervised refinement, augmented by Self-Supervised Consistency Regularization on unlabeled data. On the BlaVeS dataset, HAC achieves competitive accuracy (), high precision (), and strong topological consistency (clDice ), while effectively suppressing false positives from folds and artifacts, enabling more reliable intraoperative navigation. The work advances toward a patient-specific bladder digital twin by enabling stable vascular landmark extraction and paves the way for frame-to-frame vascular matching for continuous navigation across interventions.

Abstract

Urinary bladder cancer surveillance requires tracking tumor sites across repeated interventions, yet the deformable and hollow bladder lacks stable landmarks for orientation. While blood vessels visible during endoscopy offer a patient-specific "vascular fingerprint" for navigation, automated segmentation is challenged by imperfect endoscopic data, including sparse labels, artifacts like bubbles or variable lighting, continuous deformation, and mucosal folds that mimic vessels. State-of-the-art vessel segmentation methods often fail to address these domain-specific complexities. We introduce a Hybrid Attention-Convolution (HAC) architecture that combines Transformers to capture global vessel topology prior with a CNN that learns a residual refinement map to precisely recover thin-vessel details. To prioritize structural connectivity, the Transformer is trained on optimized ground truth data that exclude short and terminal branches. Furthermore, to address data scarcity, we employ a physics-aware pretraining, that is a self-supervised strategy using clinically grounded augmentations on unlabeled data. Evaluated on the BlaVeS dataset, consisting of endoscopic video frames, our approach achieves high accuracy (0.94) and superior precision (0.61) and clDice (0.66) compared to state-of-the-art medical segmentation models. Crucially, our method successfully suppresses false positives from mucosal folds that dynamically appear and vanish as the bladder fills and empties during surgery. Hence, HAC provides the reliable structural stability required for clinical navigation.
Paper Structure (22 sections, 19 equations, 3 figures, 3 tables)

This paper contains 22 sections, 19 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Framework Overview. Blood vessels serve as a patient-specific vascular fingerprint for navigation. We utilize self-supervised physics-aware pre-training to handle cystoscopic challenges (e.g., bubbles, variable lighting) and a Hybrid Attention-Convolution (HAC) architecture to combine global context with fine-scale precision for robust segmentation.
  • Figure 2: Qualitative comparison on challenging cystoscopic conditions using video frames from the BlaVes testset. Top: Prominent tissue fold. Middle: Severe motion blur. Bottom: Strong vignetting and edge artifacts. Boxed regions highlight difficult areas.
  • Figure 3: Qualitative comparison of vessel segmentation on two consecutive frames from the same cystoscopy. The boxed regions highlight the same anatomical area that must be matched across frames despite challenges such as motion blur, deformation, and a shifting field of view.