Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs

Yan Shu; Chi Liu; Robin Chen; Derek Li; Bryan Dai

Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs

Yan Shu, Chi Liu, Robin Chen, Derek Li, Bryan Dai

TL;DR

Fleming-VL introduces a unified medical visual-language model capable of handling 2D images, 3D volumetric data, and temporal videos in a single end-to-end framework. It achieves this through a data-centric pipeline combining interleaved pretraining, targeted data augmentation for underrepresented modalities, extensive instruction tuning, and GRPO-based reinforcement learning, all built on a Vision-Projection-Language backbone using InternViT and V2PE. The authors demonstrate state-of-the-art performance across 9 medical VQA, 3D, and video benchmarks, with notable improvements in radiology report generation and temporal understanding, while balancing cross-modal capabilities and reducing modality bias. They publicly release model weights, training data, and evaluation protocols to promote transparent, reproducible progress toward practical, safety-conscious medical AI systems.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable effectiveness in various general-domain scenarios, such as visual question answering and image captioning. Recently, researchers have increasingly focused on empowering MLLMs with medical conversational abilities, which hold significant promise for clinical applications. However, medical data presents unique challenges due to its heterogeneous nature -- encompassing diverse modalities including 2D images, 3D volumetric scans, and temporal video sequences. The substantial domain gap and data format inconsistencies across these modalities have hindered the development of unified medical MLLMs. To address these challenges, we propose Fleming-VL, a unified end-to-end framework for comprehensive medical visual understanding across heterogeneous modalities. Fleming-VL tackles this problem from a data-centric perspective through three key strategies: (1) scaling up pretraining by integrating long-context data from both natural and medical-specific domains; (2) complementing fine-tuning with rare medical data, including holistic video analysis and underrepresented 2D modalities such as ultrasound and dermoscopy images; (3) extending existing evaluation frameworks to incorporate 3D volumetric and video understanding benchmarks. Through supervised fine-tuning (SFT) and group relative policy optimization (GRPO), we develop Fleming-VL in multiple model scales. Extensive experiments demonstrate that Fleming-VL achieves state-of-the-art performance across multiple benchmarks, including medical VQA, video QA, and 3D medical image understanding. We publicly release Fleming-VL to promote transparent, reproducible, and auditable progress in medical AI.

Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs

TL;DR

Abstract

Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)