Table of Contents
Fetching ...

Multimodal Fusion at Three Tiers: Physics-Driven Data Generation and Vision-Language Guidance for Brain Tumor Segmentation

Mingda Zhang

TL;DR

This work introduces a three-tier fusion architecture for brain tumor segmentation that integrates pixel-level physics-driven data generation, feature-level cross-modal fusion via asynchronous multi-teacher distillation, and semantic-level guidance through CLIP-GPT-4V, bridging data-driven learning with clinical knowledge. Pixel-level generation produces four-modal inputs (MRI, simulated US, synthetic CT) using explicit mappings $Z_{ ext{acoustic}} = f_{ ext{MRI}\rightarrow Z}(I_{ ext{MRI}}; \theta_{ ext{mapper}})$ and $p(I_{ ext{CT}}|I_{ ext{MRI}}) = \sum_{k=1}^{K} \pi_k \mathcal{N}(I_{ ext{CT}}; \mu_k, \Sigma_k)$. Feature-level fusion employs cosine-based cross-modal alignment $A_{i,j}$ and a dual distillation loss $\mathcal{L}_{\text{distill}}$, integrating MRI, US, CT teachers and a semantic teacher, with FiLM injecting semantic guidance $F_{ ext{3D}}$ into the fused features. Semantic-level guidance converts GPT-4V clinical descriptions into spatial priors via CLIP, multi-view fusion $F_{ ext{3D}} = \frac{1}{3} \sum_d F_s^d$, cross-modal mapping $F_{ ext{combined}}$, and a semantic attention mechanism that enhances ET/TC localization. Validation on BraTS 2020/2021/2023 shows Dice coefficients of 0.867, 0.901, and 0.891, with consistent HD_{95} reductions, demonstrating substantial gains in boundary localization and small-target segmentation with practical inference times.

Abstract

Accurate brain tumor segmentation is crucial for neuro-oncology diagnosis and treatment planning. Deep learning methods have made significant progress, but automatic segmentation still faces challenges, including tumor morphological heterogeneity and complex three-dimensional spatial relationships. This paper proposes a three-tier fusion architecture that achieves precise brain tumor segmentation. The method processes information progressively at the pixel, feature, and semantic levels. At the pixel level, physical modeling extends magnetic resonance imaging (MRI) to multimodal data, including simulated ultrasound and synthetic computed tomography (CT). At the feature level, the method performs Transformer-based cross-modal feature fusion through multi-teacher collaborative distillation, integrating three expert teachers (MRI, US, CT). At the semantic level, clinical textual knowledge generated by GPT-4V is transformed into spatial guidance signals using CLIP contrastive learning and Feature-wise Linear Modulation (FiLM). These three tiers together form a complete processing chain from data augmentation to feature extraction to semantic guidance. We validated the method on the Brain Tumor Segmentation (BraTS) 2020, 2021, and 2023 datasets. The model achieves average Dice coefficients of 0.8665, 0.9014, and 0.8912 on the three datasets, respectively, and reduces the 95% Hausdorff Distance (HD95) by an average of 6.57 millimeters compared with the baseline. This method provides a new paradigm for precise tumor segmentation and boundary localization.

Multimodal Fusion at Three Tiers: Physics-Driven Data Generation and Vision-Language Guidance for Brain Tumor Segmentation

TL;DR

This work introduces a three-tier fusion architecture for brain tumor segmentation that integrates pixel-level physics-driven data generation, feature-level cross-modal fusion via asynchronous multi-teacher distillation, and semantic-level guidance through CLIP-GPT-4V, bridging data-driven learning with clinical knowledge. Pixel-level generation produces four-modal inputs (MRI, simulated US, synthetic CT) using explicit mappings and . Feature-level fusion employs cosine-based cross-modal alignment and a dual distillation loss , integrating MRI, US, CT teachers and a semantic teacher, with FiLM injecting semantic guidance into the fused features. Semantic-level guidance converts GPT-4V clinical descriptions into spatial priors via CLIP, multi-view fusion , cross-modal mapping , and a semantic attention mechanism that enhances ET/TC localization. Validation on BraTS 2020/2021/2023 shows Dice coefficients of 0.867, 0.901, and 0.891, with consistent HD_{95} reductions, demonstrating substantial gains in boundary localization and small-target segmentation with practical inference times.

Abstract

Accurate brain tumor segmentation is crucial for neuro-oncology diagnosis and treatment planning. Deep learning methods have made significant progress, but automatic segmentation still faces challenges, including tumor morphological heterogeneity and complex three-dimensional spatial relationships. This paper proposes a three-tier fusion architecture that achieves precise brain tumor segmentation. The method processes information progressively at the pixel, feature, and semantic levels. At the pixel level, physical modeling extends magnetic resonance imaging (MRI) to multimodal data, including simulated ultrasound and synthetic computed tomography (CT). At the feature level, the method performs Transformer-based cross-modal feature fusion through multi-teacher collaborative distillation, integrating three expert teachers (MRI, US, CT). At the semantic level, clinical textual knowledge generated by GPT-4V is transformed into spatial guidance signals using CLIP contrastive learning and Feature-wise Linear Modulation (FiLM). These three tiers together form a complete processing chain from data augmentation to feature extraction to semantic guidance. We validated the method on the Brain Tumor Segmentation (BraTS) 2020, 2021, and 2023 datasets. The model achieves average Dice coefficients of 0.8665, 0.9014, and 0.8912 on the three datasets, respectively, and reduces the 95% Hausdorff Distance (HD95) by an average of 6.57 millimeters compared with the baseline. This method provides a new paradigm for precise tumor segmentation and boundary localization.

Paper Structure

This paper contains 20 sections, 16 equations, 3 figures.

Figures (3)

  • Figure 1: Overall framework of the three-tier fusion architecture. The architecture processes MRI input through three parallel pathways: $T_{\text{MRI}}$ (original MRI teacher), $T_{\text{US}}$ (simulated ultrasound teacher via DiffUS), and $T_{\text{CT}}$ (synthetic CT teacher via density inference). Feature maps and outputs from the three teachers guide the student network through cross-modal feature fusion. GPT-4V generates clinical descriptions from MRI slices. CLIP contrastive learning transforms these into semantic guidance $F_{\text{3D}}$. The FiLM (Feature-wise Linear Modulation) module modulates student features with semantic information. This enables precise segmentation through multi-teacher distillation ($\mathcal{L}_{\text{resp}}$ and $\mathcal{L}_{\text{feat}}$).
  • Figure 2: Three-dimensional visualization based on ITK-SNAP: grayscale MRI with whole tumor (green), tumor core (yellow), enhancing tumor (red) masks displayed in axial, sagittal, and coronal views.
  • Figure 3: Semantic attention visualization: left column shows grayscale MRI; next three columns are enhancing tumor, tumor core, whole tumor dedicated attention heatmaps; rightmost column shows semantic attention (CLIP and GPT guided); second row gives thermal overlay with base image, with leftmost being weighted fusion attention.