Table of Contents
Fetching ...

VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge

Vishwesh Nath, Wenqi Li, Dong Yang, Andriy Myronenko, Mingxin Zheng, Yao Lu, Zhijian Liu, Hongxu Yin, Yucheng Tang, Pengfei Guo, Can Zhao, Ziyue Xu, Yufan He, Greg Heinrich, Yee Man Law, Benjamin Simon, Stephanie Harmon, Stephen Aylward, Marc Edgar, Michael Zephyr, Song Han, Pavlo Molchanov, Baris Turkbey, Holger Roth, Daguang Xu

TL;DR

VILA-M3 introduces a medical vision-language framework that injects domain-expert model knowledge into expert-guided instruction fine-tuning, enabling accurate VQA, classification, segmentation via expert models, and radiology report generation. By leveraging 2D/3D fusion and dynamic expert triggering, the approach achieves state-of-the-art performance across multiple medical benchmarks while maintaining generalist capabilities. Ablation studies confirm the value of expert-guided IFT, data balancing, and robust training dynamics, with statistical evidence showing superiority over GPT-4o baselines in several settings. The work demonstrates the practical viability of integrating specialist models into medical VLMs and outlines future directions including retrieval-augmented generation and multi-agent expert coordination for broader clinical tasks.

Abstract

Generalist vision language models (VLMs) have made significant strides in computer vision, but they fall short in specialized fields like healthcare, where expert knowledge is essential. In traditional computer vision tasks, creative or approximate answers may be acceptable, but in healthcare, precision is paramount.Current large multimodal models like Gemini and GPT-4o are insufficient for medical tasks due to their reliance on memorized internet knowledge rather than the nuanced expertise required in healthcare. VLMs are usually trained in three stages: vision pre-training, vision-language pre-training, and instruction fine-tuning (IFT). IFT has been typically applied using a mixture of generic and healthcare data. In contrast, we propose that for medical VLMs, a fourth stage of specialized IFT is necessary, which focuses on medical data and includes information from domain expert models. Domain expert models developed for medical use are crucial because they are specifically trained for certain clinical tasks, e.g. to detect tumors and classify abnormalities through segmentation and classification, which learn fine-grained features of medical data$-$features that are often too intricate for a VLM to capture effectively especially in radiology. This paper introduces a new framework, VILA-M3, for medical VLMs that utilizes domain knowledge via expert models. Through our experiments, we show an improved state-of-the-art (SOTA) performance with an average improvement of ~9% over the prior SOTA model Med-Gemini and ~6% over models trained on the specific tasks. Our approach emphasizes the importance of domain expertise in creating precise, reliable VLMs for medical applications.

VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge

TL;DR

VILA-M3 introduces a medical vision-language framework that injects domain-expert model knowledge into expert-guided instruction fine-tuning, enabling accurate VQA, classification, segmentation via expert models, and radiology report generation. By leveraging 2D/3D fusion and dynamic expert triggering, the approach achieves state-of-the-art performance across multiple medical benchmarks while maintaining generalist capabilities. Ablation studies confirm the value of expert-guided IFT, data balancing, and robust training dynamics, with statistical evidence showing superiority over GPT-4o baselines in several settings. The work demonstrates the practical viability of integrating specialist models into medical VLMs and outlines future directions including retrieval-augmented generation and multi-agent expert coordination for broader clinical tasks.

Abstract

Generalist vision language models (VLMs) have made significant strides in computer vision, but they fall short in specialized fields like healthcare, where expert knowledge is essential. In traditional computer vision tasks, creative or approximate answers may be acceptable, but in healthcare, precision is paramount.Current large multimodal models like Gemini and GPT-4o are insufficient for medical tasks due to their reliance on memorized internet knowledge rather than the nuanced expertise required in healthcare. VLMs are usually trained in three stages: vision pre-training, vision-language pre-training, and instruction fine-tuning (IFT). IFT has been typically applied using a mixture of generic and healthcare data. In contrast, we propose that for medical VLMs, a fourth stage of specialized IFT is necessary, which focuses on medical data and includes information from domain expert models. Domain expert models developed for medical use are crucial because they are specifically trained for certain clinical tasks, e.g. to detect tumors and classify abnormalities through segmentation and classification, which learn fine-grained features of medical datafeatures that are often too intricate for a VLM to capture effectively especially in radiology. This paper introduces a new framework, VILA-M3, for medical VLMs that utilizes domain knowledge via expert models. Through our experiments, we show an improved state-of-the-art (SOTA) performance with an average improvement of ~9% over the prior SOTA model Med-Gemini and ~6% over models trained on the specific tasks. Our approach emphasizes the importance of domain expertise in creating precise, reliable VLMs for medical applications.

Paper Structure

This paper contains 34 sections, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Left: Comparison of VILA-M3 with SOTA benchmarks such as Med-Gemini and task-specific SOTA models. VILA-M3-40B performance is shown in comparison. It can be observed that VILA-M3 provides a generalizable better performance for all datasets. Right: VILA-M3 architecture overview, the model aligns visual features using a projection layer with textual user prompts and model cards describing available "expert models".
  • Figure 2: VILA-M3 possesses the capability to support a diverse range of tasks, including visual question answering, classification, and report generation. Segmentation tasks are performed by suitable "expert models", such as the BraTS brain tumor segmentation model for multimodal MRI.
  • Figure 3: Feedback of segmentation results can improve the quality of responses received from VLMs. This observation holds true for both VILA-M3 and GPT-4o. The models without expert segmentation fail to detect the tumor, unlike the models with access to expert model segmentation. The blue annotation box shows the marked tumor location, traditional VLM’s cannot capture such fine features unless guided by expert outcomes.
  • Figure 4: The heatmap shows the performance of the 8B model on all datasets with trained models at 1, 2, and 3 epochs. It can be observed that model performance degrades for the epoch 3 model.
  • Figure 5: Comparison of VILA-M3 training with balanced and unbalanced healthcare datasets. Comparison for 3B model is shown with a training of two epochs each.
  • ...and 6 more figures