Table of Contents
Fetching ...

A Multimodal LLM Approach for Visual Question Answering on Multiparametric 3D Brain MRI

Arvind Murari Vepa, Yannan Yu, Jingru Gan, Anthony Cuturrufo, Weikai Li, Wei Wang, Fabien Scalzo, Yizhou Sun

TL;DR

This work tackles the challenge of visual question answering on multiparametric 3D brain MRI by introducing mpLLM, a prompt-conditioned hierarchical MoE that fuses interrelated mpMRI modalities through modality- and token-level routing and feeds a language model for VQA. To overcome scarce image-text supervise, it employs a clinician-validated synthetic VQA pipeline derived from segmentation annotations, enabling end-to-end fine-tuning without image–report pretraining. mpLLM demonstrates strong empirical performance, outperforming strong medical VLM baselines by an average of 5.3% across BraTS-derived mpMRI datasets while using substantially less GPU memory. The paper provides a first clinically validated 3D mpMRI VQA dataset, a novel multimodal LLM architecture for interdependent 3D modalities, and evidence for the medical utility of hierarchical MoE-based fusion in this domain.

Abstract

We introduce mpLLM, a prompt-conditioned hierarchical mixture-of-experts (MoE) architecture for visual question answering over multi-parametric 3D brain MRI (mpMRI). mpLLM routes across modality-level and token-level projection experts to fuse multiple interrelated 3D modalities, enabling efficient training without image-report pretraining. To address limited image-text paired supervision, mpLLM integrates a synthetic visual question answering (VQA) protocol that generates medically relevant VQA from segmentation annotations, and we collaborate with medical experts for clinical validation. mpLLM outperforms strong medical VLM baselines by 5.3% on average across multiple mpMRI datasets. Our study features three main contributions: (1) the first clinically validated VQA dataset for 3D brain mpMRI, (2) a novel multimodal LLM that handles multiple interrelated 3D modalities, and (3) strong empirical results that demonstrate the medical utility of our methodology. Ablations highlight the importance of modality-level and token-level experts and prompt-conditioned routing.

A Multimodal LLM Approach for Visual Question Answering on Multiparametric 3D Brain MRI

TL;DR

This work tackles the challenge of visual question answering on multiparametric 3D brain MRI by introducing mpLLM, a prompt-conditioned hierarchical MoE that fuses interrelated mpMRI modalities through modality- and token-level routing and feeds a language model for VQA. To overcome scarce image-text supervise, it employs a clinician-validated synthetic VQA pipeline derived from segmentation annotations, enabling end-to-end fine-tuning without image–report pretraining. mpLLM demonstrates strong empirical performance, outperforming strong medical VLM baselines by an average of 5.3% across BraTS-derived mpMRI datasets while using substantially less GPU memory. The paper provides a first clinically validated 3D mpMRI VQA dataset, a novel multimodal LLM architecture for interdependent 3D modalities, and evidence for the medical utility of hierarchical MoE-based fusion in this domain.

Abstract

We introduce mpLLM, a prompt-conditioned hierarchical mixture-of-experts (MoE) architecture for visual question answering over multi-parametric 3D brain MRI (mpMRI). mpLLM routes across modality-level and token-level projection experts to fuse multiple interrelated 3D modalities, enabling efficient training without image-report pretraining. To address limited image-text paired supervision, mpLLM integrates a synthetic visual question answering (VQA) protocol that generates medically relevant VQA from segmentation annotations, and we collaborate with medical experts for clinical validation. mpLLM outperforms strong medical VLM baselines by 5.3% on average across multiple mpMRI datasets. Our study features three main contributions: (1) the first clinically validated VQA dataset for 3D brain mpMRI, (2) a novel multimodal LLM that handles multiple interrelated 3D modalities, and (3) strong empirical results that demonstrate the medical utility of our methodology. Ablations highlight the importance of modality-level and token-level experts and prompt-conditioned routing.

Paper Structure

This paper contains 49 sections, 13 equations, 3 figures, 14 tables.

Figures (3)

  • Figure 1: High-level comparison between LLaVA-Med and mpLLM. While LLaVA-Med uses a standard projection layer, our method uses a hierarchical MoE block which ingests both the prompt and imaging to produce prompt-conditioned vision tokens that leverage all the 3D modalities.
  • Figure 2: Detailed overview of our mpLLM pipeline.
  • Figure 3: Heatmap for correlation between high-level expert weight vectors for standard prompts in the GLI dataset. NETC = non-enhancing tumor core, ET = enhancing tissue, SNFH = surrounding FLAIR hyperintensity, RC = resection cavity.