Table of Contents
Fetching ...

Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs

Yan Shu, Chi Liu, Robin Chen, Derek Li, Bryan Dai

TL;DR

Fleming-VL introduces a unified medical visual-language model capable of handling 2D images, 3D volumetric data, and temporal videos in a single end-to-end framework. It achieves this through a data-centric pipeline combining interleaved pretraining, targeted data augmentation for underrepresented modalities, extensive instruction tuning, and GRPO-based reinforcement learning, all built on a Vision-Projection-Language backbone using InternViT and V2PE. The authors demonstrate state-of-the-art performance across 9 medical VQA, 3D, and video benchmarks, with notable improvements in radiology report generation and temporal understanding, while balancing cross-modal capabilities and reducing modality bias. They publicly release model weights, training data, and evaluation protocols to promote transparent, reproducible progress toward practical, safety-conscious medical AI systems.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable effectiveness in various general-domain scenarios, such as visual question answering and image captioning. Recently, researchers have increasingly focused on empowering MLLMs with medical conversational abilities, which hold significant promise for clinical applications. However, medical data presents unique challenges due to its heterogeneous nature -- encompassing diverse modalities including 2D images, 3D volumetric scans, and temporal video sequences. The substantial domain gap and data format inconsistencies across these modalities have hindered the development of unified medical MLLMs. To address these challenges, we propose Fleming-VL, a unified end-to-end framework for comprehensive medical visual understanding across heterogeneous modalities. Fleming-VL tackles this problem from a data-centric perspective through three key strategies: (1) scaling up pretraining by integrating long-context data from both natural and medical-specific domains; (2) complementing fine-tuning with rare medical data, including holistic video analysis and underrepresented 2D modalities such as ultrasound and dermoscopy images; (3) extending existing evaluation frameworks to incorporate 3D volumetric and video understanding benchmarks. Through supervised fine-tuning (SFT) and group relative policy optimization (GRPO), we develop Fleming-VL in multiple model scales. Extensive experiments demonstrate that Fleming-VL achieves state-of-the-art performance across multiple benchmarks, including medical VQA, video QA, and 3D medical image understanding. We publicly release Fleming-VL to promote transparent, reproducible, and auditable progress in medical AI.

Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs

TL;DR

Fleming-VL introduces a unified medical visual-language model capable of handling 2D images, 3D volumetric data, and temporal videos in a single end-to-end framework. It achieves this through a data-centric pipeline combining interleaved pretraining, targeted data augmentation for underrepresented modalities, extensive instruction tuning, and GRPO-based reinforcement learning, all built on a Vision-Projection-Language backbone using InternViT and V2PE. The authors demonstrate state-of-the-art performance across 9 medical VQA, 3D, and video benchmarks, with notable improvements in radiology report generation and temporal understanding, while balancing cross-modal capabilities and reducing modality bias. They publicly release model weights, training data, and evaluation protocols to promote transparent, reproducible progress toward practical, safety-conscious medical AI systems.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable effectiveness in various general-domain scenarios, such as visual question answering and image captioning. Recently, researchers have increasingly focused on empowering MLLMs with medical conversational abilities, which hold significant promise for clinical applications. However, medical data presents unique challenges due to its heterogeneous nature -- encompassing diverse modalities including 2D images, 3D volumetric scans, and temporal video sequences. The substantial domain gap and data format inconsistencies across these modalities have hindered the development of unified medical MLLMs. To address these challenges, we propose Fleming-VL, a unified end-to-end framework for comprehensive medical visual understanding across heterogeneous modalities. Fleming-VL tackles this problem from a data-centric perspective through three key strategies: (1) scaling up pretraining by integrating long-context data from both natural and medical-specific domains; (2) complementing fine-tuning with rare medical data, including holistic video analysis and underrepresented 2D modalities such as ultrasound and dermoscopy images; (3) extending existing evaluation frameworks to incorporate 3D volumetric and video understanding benchmarks. Through supervised fine-tuning (SFT) and group relative policy optimization (GRPO), we develop Fleming-VL in multiple model scales. Extensive experiments demonstrate that Fleming-VL achieves state-of-the-art performance across multiple benchmarks, including medical VQA, video QA, and 3D medical image understanding. We publicly release Fleming-VL to promote transparent, reproducible, and auditable progress in medical AI.

Paper Structure

This paper contains 22 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Medical multimodal understanding benchmarks performance. "HVU", "MVQA" and "3D-MVQA" denote holistic video understanding, medical video question answering and 3D-medical video question answering.
  • Figure 2: Fleming-VL data curation pipeline.
  • Figure 3: Fleming-VL data curation pipeline.
  • Figure 4: Visualization of Fleming-VL,in which models can reason across different modalities.