Table of Contents
Fetching ...

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

Tianbin Li, Yanzhou Su, Wei Li, Bin Fu, Zhe Chen, Ziyan Huang, Guoan Wang, Chenglong Ma, Ying Chen, Ming Hu, Yanjun Li, Pengcheng Chen, Xiaowei Hu, Zhongying Deng, Yuanfeng Ji, Jin Ye, Yu Qiao, Junjun He

TL;DR

<3-5 sentence high-level summary> Addressing the gap in medical general AI due to domain-specific data scarcity, the paper introduces GMAI-VL-5.5M, a large multimodal dataset created via annotation-guided conversion of 219 datasets into image-text pairs. Built on top of this dataset, GMAI-VL employs a three-stage training pipeline (shadow alignment, deep alignment, instruction tuning) to tightly fuse visual and linguistic medical information. The model achieves state-of-the-art performance on OmniMedVQA, GMAI-MMBench, and MMMU Health & Medicine benchmarks, demonstrating strong multimodal diagnosis and reasoning capabilities. The work also emphasizes data quality, traceability, and multilingual support to boost generalization in diverse clinical settings.

Abstract

Despite significant advancements in general AI, its effectiveness in the medical domain is limited by the lack of specialized medical knowledge. To address this, we formulate GMAI-VL-5.5M, a multimodal medical dataset created by converting hundreds of specialized medical datasets with various annotations into high-quality image-text pairs. This dataset offers comprehensive task coverage, diverse modalities, and rich image-text data. Building upon this dataset, we develop GMAI-VL, a general medical vision-language model, with a three-stage training strategy that enhances the integration of visual and textual information. This approach significantly improves the model's ability to process multimodal data, supporting accurate diagnoses and clinical decision-making. Experiments show that GMAI-VL achieves state-of-the-art performance across various multimodal medical tasks, including visual question answering and medical image diagnosis.

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

TL;DR

<3-5 sentence high-level summary> Addressing the gap in medical general AI due to domain-specific data scarcity, the paper introduces GMAI-VL-5.5M, a large multimodal dataset created via annotation-guided conversion of 219 datasets into image-text pairs. Built on top of this dataset, GMAI-VL employs a three-stage training pipeline (shadow alignment, deep alignment, instruction tuning) to tightly fuse visual and linguistic medical information. The model achieves state-of-the-art performance on OmniMedVQA, GMAI-MMBench, and MMMU Health & Medicine benchmarks, demonstrating strong multimodal diagnosis and reasoning capabilities. The work also emphasizes data quality, traceability, and multilingual support to boost generalization in diverse clinical settings.

Abstract

Despite significant advancements in general AI, its effectiveness in the medical domain is limited by the lack of specialized medical knowledge. To address this, we formulate GMAI-VL-5.5M, a multimodal medical dataset created by converting hundreds of specialized medical datasets with various annotations into high-quality image-text pairs. This dataset offers comprehensive task coverage, diverse modalities, and rich image-text data. Building upon this dataset, we develop GMAI-VL, a general medical vision-language model, with a three-stage training strategy that enhances the integration of visual and textual information. This approach significantly improves the model's ability to process multimodal data, supporting accurate diagnoses and clinical decision-making. Experiments show that GMAI-VL achieves state-of-the-art performance across various multimodal medical tasks, including visual question answering and medical image diagnosis.

Paper Structure

This paper contains 27 sections, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Overview of GMAI-VL and GMAI-VL-5.5M. (a) Sources, departments, modalities, task types, and instruction formats of GMAI-VL-5.5M. (b) Architecture of GMAI-VL, with a Vision Encoder, Projector, and Large Language Model. (c) Three-stage training process, including shallow alignment, deep alignment, and instruction tuning, with corresponding data sizes and components. indicates the training part while indicates the frozen part.
  • Figure 2: Example results of our GMAI-VL model. Figure (e) is a failed case.
  • Figure 4: Distribution of the training dataset. The inner ring represents major categories, each shown in a distinct color, while the outer ring depicts the corresponding subcategories. Segment sizes are proportional to data volume, as indicated in the legend, which also provides the data volume for each subcategory.
  • Figure : (a) Modality distribution
  • Figure : (a) Modality distribution
  • ...and 3 more figures