Table of Contents
Fetching ...

VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

Lingxiao Luo, Bingda Tang, Xuanzhong Chen, Rong Han, Ting Chen

TL;DR

VividMed addresses the need for versatile visual grounding in medical vision-language models by introducing a base VLM augmented with a promptable localization module that can produce both segmentation masks and instance-level bounding boxes for 2D and 3D medical data. It employs a three-stage training pipeline and an automatic data synthesis process to enable grounded report generation alongside VQA and captioning, leveraging open-domain datasets. Empirical results show that incorporating grounding improves performance across downstream tasks and that grounded reports align with clinical evaluation metrics, while ablations confirm the beneficial role of grounding. The work advances grounded medical AI and provides a foundation for broader clinical applications, though generalization beyond chest imaging and open-data constraints remain areas for future work.

Abstract

Recent advancements in Vision Language Models (VLMs) have demonstrated remarkable promise in generating visually grounded responses. However, their application in the medical domain is hindered by unique challenges. For instance, most VLMs rely on a single method of visual grounding, whereas complex medical tasks demand more versatile approaches. Additionally, while most VLMs process only 2D images, a large portion of medical images are 3D. The lack of medical data further compounds these obstacles. To address these challenges, we present VividMed, a vision language model with versatile visual grounding for medicine. Our model supports generating both semantic segmentation masks and instance-level bounding boxes, and accommodates various imaging modalities, including both 2D and 3D data. We design a three-stage training procedure and an automatic data synthesis pipeline based on open datasets and models. Besides visual grounding tasks, VividMed also excels in other common downstream tasks, including Visual Question Answering (VQA) and report generation. Ablation studies empirically show that the integration of visual grounding ability leads to improved performance on these tasks. Our code is publicly available at https://github.com/function2-llx/MMMM.

VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

TL;DR

VividMed addresses the need for versatile visual grounding in medical vision-language models by introducing a base VLM augmented with a promptable localization module that can produce both segmentation masks and instance-level bounding boxes for 2D and 3D medical data. It employs a three-stage training pipeline and an automatic data synthesis process to enable grounded report generation alongside VQA and captioning, leveraging open-domain datasets. Empirical results show that incorporating grounding improves performance across downstream tasks and that grounded reports align with clinical evaluation metrics, while ablations confirm the beneficial role of grounding. The work advances grounded medical AI and provides a foundation for broader clinical applications, though generalization beyond chest imaging and open-data constraints remain areas for future work.

Abstract

Recent advancements in Vision Language Models (VLMs) have demonstrated remarkable promise in generating visually grounded responses. However, their application in the medical domain is hindered by unique challenges. For instance, most VLMs rely on a single method of visual grounding, whereas complex medical tasks demand more versatile approaches. Additionally, while most VLMs process only 2D images, a large portion of medical images are 3D. The lack of medical data further compounds these obstacles. To address these challenges, we present VividMed, a vision language model with versatile visual grounding for medicine. Our model supports generating both semantic segmentation masks and instance-level bounding boxes, and accommodates various imaging modalities, including both 2D and 3D data. We design a three-stage training procedure and an automatic data synthesis pipeline based on open datasets and models. Besides visual grounding tasks, VividMed also excels in other common downstream tasks, including Visual Question Answering (VQA) and report generation. Ablation studies empirically show that the integration of visual grounding ability leads to improved performance on these tasks. Our code is publicly available at https://github.com/function2-llx/MMMM.

Paper Structure

This paper contains 63 sections, 3 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: The architecture of VividMed, which is built upon a base VLM (left and lower) and a promptable localization module (upper right). The model identifies key phrases for grounding by enclosing them with bracket tokens, and the hidden states of the closed bracket token is used for prompting the localization module. The query tokens for both mask and instances are fed to the transformer-based localization decoder in parallel. The bounding boxes for negative instances are illustrated with dashed lines. The model accepts both 2D and 3D images as input by adaptively adjusting weights in the patch embedding layer. The vision encoder of the localization module is omitted for clarity.
  • Figure 2: Selected qualitative results for grounded report generation, zoom in for better view. Impressions are omitted for clarity.
  • Figure 3: In this example, the model wrongly identifies cardiomegaly and gives an unusual visual grounding result, which may remind the radiologist in clinical practice.
  • Figure 4: In this example, the model correctly identifies cardiomegaly and atelectasis, validated by corresponding bounding boxes output by the visual grounding. However, it omits the presented opacity.
  • Figure 5: In this example, the model correctly reports that no abnormality is presented.
  • ...and 7 more figures