Table of Contents
Fetching ...

TubeMLLM: A Foundation Model for Topology Knowledge Exploration in Vessel-like Anatomy

Yaoyu Liu, Minghui Zhang, Xin You, Hanxiao Zhang, Yun Gu

TL;DR

TubeMLLM is proposed, a unified foundation model that couples structured understanding with controllable generation for medical vessel-like anatomy and significantly enhances topology-aware perception by integrating topological priors through explicit natural language prompting and aligning them with visual representations in a shared-attention architecture.

Abstract

Modeling medical vessel-like anatomy is challenging due to its intricate topology and sensitivity to dataset shifts. Consequently, task-specific models often suffer from topological inconsistencies, including artificial disconnections and spurious merges. Motivated by the promise of multimodal large language models (MLLMs) for zero-shot generalization, we propose TubeMLLM, a unified foundation model that couples structured understanding with controllable generation for medical vessel-like anatomy. By integrating topological priors through explicit natural language prompting and aligning them with visual representations in a shared-attention architecture, TubeMLLM significantly enhances topology-aware perception. Furthermore, we construct TubeMData, a pionner multimodal benchmark comprising comprehensive topology-centric tasks, and introduce an adaptive loss weighting strategy to emphasize topology-critical regions during training. Extensive experiments on fifteen diverse datasets demonstrate our superiority. Quantitatively, TubeMLLM achieves state-of-the-art out-of-distribution performance, substantially reducing global topological discrepancies on color fundus photography (decreasing the $β_{0}$ number error from 37.42 to 8.58 compared to baselines). Notably, TubeMLLM exhibits exceptional zero-shot cross-modality transferring ability on unseen X-ray angiography, achieving a Dice score of 67.50% while significantly reducing the $β_{0}$ error to 1.21. TubeMLLM also maintains robustness against degradations such as blur, noise, and low resolution. Furthermore, in topology-aware understanding tasks, the model achieves 97.38% accuracy in evaluating mask topological quality, significantly outperforming standard vision-language baselines.

TubeMLLM: A Foundation Model for Topology Knowledge Exploration in Vessel-like Anatomy

TL;DR

TubeMLLM is proposed, a unified foundation model that couples structured understanding with controllable generation for medical vessel-like anatomy and significantly enhances topology-aware perception by integrating topological priors through explicit natural language prompting and aligning them with visual representations in a shared-attention architecture.

Abstract

Modeling medical vessel-like anatomy is challenging due to its intricate topology and sensitivity to dataset shifts. Consequently, task-specific models often suffer from topological inconsistencies, including artificial disconnections and spurious merges. Motivated by the promise of multimodal large language models (MLLMs) for zero-shot generalization, we propose TubeMLLM, a unified foundation model that couples structured understanding with controllable generation for medical vessel-like anatomy. By integrating topological priors through explicit natural language prompting and aligning them with visual representations in a shared-attention architecture, TubeMLLM significantly enhances topology-aware perception. Furthermore, we construct TubeMData, a pionner multimodal benchmark comprising comprehensive topology-centric tasks, and introduce an adaptive loss weighting strategy to emphasize topology-critical regions during training. Extensive experiments on fifteen diverse datasets demonstrate our superiority. Quantitatively, TubeMLLM achieves state-of-the-art out-of-distribution performance, substantially reducing global topological discrepancies on color fundus photography (decreasing the number error from 37.42 to 8.58 compared to baselines). Notably, TubeMLLM exhibits exceptional zero-shot cross-modality transferring ability on unseen X-ray angiography, achieving a Dice score of 67.50% while significantly reducing the error to 1.21. TubeMLLM also maintains robustness against degradations such as blur, noise, and low resolution. Furthermore, in topology-aware understanding tasks, the model achieves 97.38% accuracy in evaluating mask topological quality, significantly outperforming standard vision-language baselines.
Paper Structure (14 sections, 9 equations, 12 figures, 2 tables)

This paper contains 14 sections, 9 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Unified modeling paradigm of the proposed TubeMLLM. (a) Compared with task-specific I2I models and promptable I2I baselines, TubeMLLM unifies text and image tokens in an MLLM with shared-attention and supports both text and image outputs. (b) TubeMLLM enables better topology, stronger zero-shot cross-modality transfer, and more reliable text guidance. (c.1) Example of topology-aware visual understanding task. (c.2) Example of topology-preserving generation task.
  • Figure 2: Detailed TubeMLLM architecture and TubeMData. (a) TubeMLLM adopts a Mixture-of-Transformers design with coupled generation transformer branch and understanding transformer branch. Adaptive loss weights are derived from error maps. (b) TubeMData CFP training and testing sample distribution across different topology-centric tasks.
  • Figure 3: Topology-centric tasks. Bold texts highlight the image modalities and topology priors encoded in language prompts. (a) Topology-refinement generation task. (b) Topology-aware understanding task that select the mask with better topology.
  • Figure 4: Qualitative results on CFP and XRA OOD test datasets. $\beta_0$Num denotes the $\beta_0$ number error. Regions inside red boxes are highlighted to demonstrate the topological accuracy of TubeMLLM.
  • Figure 5: Qualitative performance on (a) topology-refinement, (b) zero-shot transfer to XRA dataset and (c) degraded input of CFP datasets. Text colored in green demonstrates superior results or improved performance.
  • ...and 7 more figures