Multimodal Fusion at Three Tiers: Physics-Driven Data Generation and Vision-Language Guidance for Brain Tumor Segmentation
Mingda Zhang
TL;DR
This work introduces a three-tier fusion architecture for brain tumor segmentation that integrates pixel-level physics-driven data generation, feature-level cross-modal fusion via asynchronous multi-teacher distillation, and semantic-level guidance through CLIP-GPT-4V, bridging data-driven learning with clinical knowledge. Pixel-level generation produces four-modal inputs (MRI, simulated US, synthetic CT) using explicit mappings $Z_{ ext{acoustic}} = f_{ ext{MRI}\rightarrow Z}(I_{ ext{MRI}}; \theta_{ ext{mapper}})$ and $p(I_{ ext{CT}}|I_{ ext{MRI}}) = \sum_{k=1}^{K} \pi_k \mathcal{N}(I_{ ext{CT}}; \mu_k, \Sigma_k)$. Feature-level fusion employs cosine-based cross-modal alignment $A_{i,j}$ and a dual distillation loss $\mathcal{L}_{\text{distill}}$, integrating MRI, US, CT teachers and a semantic teacher, with FiLM injecting semantic guidance $F_{ ext{3D}}$ into the fused features. Semantic-level guidance converts GPT-4V clinical descriptions into spatial priors via CLIP, multi-view fusion $F_{ ext{3D}} = \frac{1}{3} \sum_d F_s^d$, cross-modal mapping $F_{ ext{combined}}$, and a semantic attention mechanism that enhances ET/TC localization. Validation on BraTS 2020/2021/2023 shows Dice coefficients of 0.867, 0.901, and 0.891, with consistent HD_{95} reductions, demonstrating substantial gains in boundary localization and small-target segmentation with practical inference times.
Abstract
Accurate brain tumor segmentation is crucial for neuro-oncology diagnosis and treatment planning. Deep learning methods have made significant progress, but automatic segmentation still faces challenges, including tumor morphological heterogeneity and complex three-dimensional spatial relationships. This paper proposes a three-tier fusion architecture that achieves precise brain tumor segmentation. The method processes information progressively at the pixel, feature, and semantic levels. At the pixel level, physical modeling extends magnetic resonance imaging (MRI) to multimodal data, including simulated ultrasound and synthetic computed tomography (CT). At the feature level, the method performs Transformer-based cross-modal feature fusion through multi-teacher collaborative distillation, integrating three expert teachers (MRI, US, CT). At the semantic level, clinical textual knowledge generated by GPT-4V is transformed into spatial guidance signals using CLIP contrastive learning and Feature-wise Linear Modulation (FiLM). These three tiers together form a complete processing chain from data augmentation to feature extraction to semantic guidance. We validated the method on the Brain Tumor Segmentation (BraTS) 2020, 2021, and 2023 datasets. The model achieves average Dice coefficients of 0.8665, 0.9014, and 0.8912 on the three datasets, respectively, and reduces the 95% Hausdorff Distance (HD95) by an average of 6.57 millimeters compared with the baseline. This method provides a new paradigm for precise tumor segmentation and boundary localization.
