Table of Contents
Fetching ...

Med-GLIP: Advancing Medical Language-Image Pre-training with Large-scale Grounded Dataset

Ziye Deng, Ruihan He, Jiaxiang Liu, Yuan Wang, Zijie Meng, Songtao Jiang, Yong Xie, Zuozhu Liu

TL;DR

The paper tackles data scarcity and modality heterogeneity in medical image grounding by introducing Med-GLIP-5M, a large-scale, seven-modality dataset with 5.3 million region-level annotations and hierarchical region labels. It proposes Med-GLIP, a modality-aware grounding framework that learns hierarchical spatial semantics through prompts and modality-specific image encoders paired with a shared language encoder, enabling effective zero-shot and fine-tuned grounding across modalities. The approach yields superior grounding accuracy across modalities and translates into meaningful gains for downstream medical VQA and report generation tasks, demonstrating strong generalization and practical impact for clinical multimodal reasoning. The dataset and framework collectively advance scalable, spatially grounded medical vision-language models and set the stage for broader integration with large language models in clinical contexts.

Abstract

Medical image grounding aims to align natural language phrases with specific regions in medical images, serving as a foundational task for intelligent diagnosis, visual question answering (VQA), and automated report generation (MRG). However, existing research is constrained by limited modality coverage, coarse-grained annotations, and the absence of a unified, generalizable grounding framework. To address these challenges, we construct a large-scale medical grounding dataset Med-GLIP-5M comprising over 5.3 million region-level annotations across seven imaging modalities, covering diverse anatomical structures and pathological findings. The dataset supports both segmentation and grounding tasks with hierarchical region labels, ranging from organ-level boundaries to fine-grained lesions. Based on this foundation, we propose Med-GLIP, a modality-aware grounding framework trained on Med-GLIP-5M. Rather than relying on explicitly designed expert modules, Med-GLIP implicitly acquires hierarchical semantic understanding from diverse training data -- enabling it to recognize multi-granularity structures, such as distinguishing lungs from pneumonia lesions. Extensive experiments demonstrate that Med-GLIP consistently outperforms state-of-the-art baselines across multiple grounding benchmarks. Furthermore, integrating its spatial outputs into downstream tasks, including medical VQA and report generation, leads to substantial performance gains. Our dataset will be released soon.

Med-GLIP: Advancing Medical Language-Image Pre-training with Large-scale Grounded Dataset

TL;DR

The paper tackles data scarcity and modality heterogeneity in medical image grounding by introducing Med-GLIP-5M, a large-scale, seven-modality dataset with 5.3 million region-level annotations and hierarchical region labels. It proposes Med-GLIP, a modality-aware grounding framework that learns hierarchical spatial semantics through prompts and modality-specific image encoders paired with a shared language encoder, enabling effective zero-shot and fine-tuned grounding across modalities. The approach yields superior grounding accuracy across modalities and translates into meaningful gains for downstream medical VQA and report generation tasks, demonstrating strong generalization and practical impact for clinical multimodal reasoning. The dataset and framework collectively advance scalable, spatially grounded medical vision-language models and set the stage for broader integration with large language models in clinical contexts.

Abstract

Medical image grounding aims to align natural language phrases with specific regions in medical images, serving as a foundational task for intelligent diagnosis, visual question answering (VQA), and automated report generation (MRG). However, existing research is constrained by limited modality coverage, coarse-grained annotations, and the absence of a unified, generalizable grounding framework. To address these challenges, we construct a large-scale medical grounding dataset Med-GLIP-5M comprising over 5.3 million region-level annotations across seven imaging modalities, covering diverse anatomical structures and pathological findings. The dataset supports both segmentation and grounding tasks with hierarchical region labels, ranging from organ-level boundaries to fine-grained lesions. Based on this foundation, we propose Med-GLIP, a modality-aware grounding framework trained on Med-GLIP-5M. Rather than relying on explicitly designed expert modules, Med-GLIP implicitly acquires hierarchical semantic understanding from diverse training data -- enabling it to recognize multi-granularity structures, such as distinguishing lungs from pneumonia lesions. Extensive experiments demonstrate that Med-GLIP consistently outperforms state-of-the-art baselines across multiple grounding benchmarks. Furthermore, integrating its spatial outputs into downstream tasks, including medical VQA and report generation, leads to substantial performance gains. Our dataset will be released soon.

Paper Structure

This paper contains 11 sections, 4 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Enhance VQA and MRG with Med-GLIP.
  • Figure 2: Med-GLIP-5M has 7 modality categories in total, with multiple organs containing suborgans in the dataset.
  • Figure 3: Illustration of hierarchical region-level annotations across modalities. Each subfigure (a–f) shows green bounding boxes and textual descriptions over CT, X-ray, ultrasound, endoscopy, and MRI images. Multi-level boxes reflect hierarchical semantics, providing fine-grained region-text supervision for structured medical grounding.
  • Figure 4: Performance comparison between w/ and w/o Med-GLIP in MRG Task on the baseline R2Gen and MLRG.
  • Figure 5: Performance comparison between w/ and w/o Med-GLIP in Med-VQA on the VQA-RAD, SLAKE, PathVQA Dataset.