Med-GLIP: Advancing Medical Language-Image Pre-training with Large-scale Grounded Dataset
Ziye Deng, Ruihan He, Jiaxiang Liu, Yuan Wang, Zijie Meng, Songtao Jiang, Yong Xie, Zuozhu Liu
TL;DR
The paper tackles data scarcity and modality heterogeneity in medical image grounding by introducing Med-GLIP-5M, a large-scale, seven-modality dataset with 5.3 million region-level annotations and hierarchical region labels. It proposes Med-GLIP, a modality-aware grounding framework that learns hierarchical spatial semantics through prompts and modality-specific image encoders paired with a shared language encoder, enabling effective zero-shot and fine-tuned grounding across modalities. The approach yields superior grounding accuracy across modalities and translates into meaningful gains for downstream medical VQA and report generation tasks, demonstrating strong generalization and practical impact for clinical multimodal reasoning. The dataset and framework collectively advance scalable, spatially grounded medical vision-language models and set the stage for broader integration with large language models in clinical contexts.
Abstract
Medical image grounding aims to align natural language phrases with specific regions in medical images, serving as a foundational task for intelligent diagnosis, visual question answering (VQA), and automated report generation (MRG). However, existing research is constrained by limited modality coverage, coarse-grained annotations, and the absence of a unified, generalizable grounding framework. To address these challenges, we construct a large-scale medical grounding dataset Med-GLIP-5M comprising over 5.3 million region-level annotations across seven imaging modalities, covering diverse anatomical structures and pathological findings. The dataset supports both segmentation and grounding tasks with hierarchical region labels, ranging from organ-level boundaries to fine-grained lesions. Based on this foundation, we propose Med-GLIP, a modality-aware grounding framework trained on Med-GLIP-5M. Rather than relying on explicitly designed expert modules, Med-GLIP implicitly acquires hierarchical semantic understanding from diverse training data -- enabling it to recognize multi-granularity structures, such as distinguishing lungs from pneumonia lesions. Extensive experiments demonstrate that Med-GLIP consistently outperforms state-of-the-art baselines across multiple grounding benchmarks. Furthermore, integrating its spatial outputs into downstream tasks, including medical VQA and report generation, leads to substantial performance gains. Our dataset will be released soon.
