Table of Contents
Fetching ...

OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding

Songtao Jiang, Yuan Wang, Sibo Song, Yan Zhang, Zijie Meng, Bohan Lei, Jian Wu, Jimeng Sun, Zuozhu Liu

TL;DR

OmniV-Med introduces a unified medical vision-language model capable of processing text, 2D/3D images, and videos using a rotary position-adaptive encoder and a medical-aware token pruning strategy that reduces tokens by about $60\%$. It is trained in three stages on a comprehensive dataset, OmniV-Med-Instruct, containing $252K$ instruction-following samples across $14$ modalities and $11$ clinical tasks, enabling emergent cross-modal alignment. The model achieves state-of-the-art results across seven medical benchmarks (2D, 3D, and video) with both a large $7$B variant (OmniV-Med-7B) and a lightweight $1.5$B variant that trains on eight RTX3090 GPUs, while enabling efficient long-video inference. This work demonstrates the viability of a scalable, unified Med-VLM framework and lays the groundwork for practical deployment in clinical workflows, supported by data and code release.

Abstract

The practical deployment of medical vision-language models (Med-VLMs) necessitates seamless integration of textual data with diverse visual modalities, including 2D/3D images and videos, yet existing models typically employ separate encoders for different modalities. To address this limitation, we present OmniV-Med, a unified framework for multimodal medical understanding. Our technical contributions are threefold: First, we construct OmniV-Med-Instruct, a comprehensive multimodal medical dataset containing 252K instructional samples spanning 14 medical image modalities and 11 clinical tasks. Second, we devise a rotary position-adaptive encoder that processes multi-resolution 2D/3D images and videos within a unified architecture, diverging from conventional modality-specific encoders. Third, we introduce a medical-aware token pruning mechanism that exploits spatial-temporal redundancy in volumetric data (e.g., consecutive CT slices) and medical videos, effectively reducing 60\% of visual tokens without performance degradation. Empirical evaluations demonstrate that OmniV-Med-7B achieves state-of-the-art performance on 7 benchmarks spanning 2D/3D medical imaging and video understanding tasks. Notably, our lightweight variant (OmniV-Med-1.5B) attains comparable performance while requiring only 8 RTX3090 GPUs for training and supporting efficient long-video inference. Data, code and model will be released.

OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding

TL;DR

OmniV-Med introduces a unified medical vision-language model capable of processing text, 2D/3D images, and videos using a rotary position-adaptive encoder and a medical-aware token pruning strategy that reduces tokens by about . It is trained in three stages on a comprehensive dataset, OmniV-Med-Instruct, containing instruction-following samples across modalities and clinical tasks, enabling emergent cross-modal alignment. The model achieves state-of-the-art results across seven medical benchmarks (2D, 3D, and video) with both a large B variant (OmniV-Med-7B) and a lightweight B variant that trains on eight RTX3090 GPUs, while enabling efficient long-video inference. This work demonstrates the viability of a scalable, unified Med-VLM framework and lays the groundwork for practical deployment in clinical workflows, supported by data and code release.

Abstract

The practical deployment of medical vision-language models (Med-VLMs) necessitates seamless integration of textual data with diverse visual modalities, including 2D/3D images and videos, yet existing models typically employ separate encoders for different modalities. To address this limitation, we present OmniV-Med, a unified framework for multimodal medical understanding. Our technical contributions are threefold: First, we construct OmniV-Med-Instruct, a comprehensive multimodal medical dataset containing 252K instructional samples spanning 14 medical image modalities and 11 clinical tasks. Second, we devise a rotary position-adaptive encoder that processes multi-resolution 2D/3D images and videos within a unified architecture, diverging from conventional modality-specific encoders. Third, we introduce a medical-aware token pruning mechanism that exploits spatial-temporal redundancy in volumetric data (e.g., consecutive CT slices) and medical videos, effectively reducing 60\% of visual tokens without performance degradation. Empirical evaluations demonstrate that OmniV-Med-7B achieves state-of-the-art performance on 7 benchmarks spanning 2D/3D medical imaging and video understanding tasks. Notably, our lightweight variant (OmniV-Med-1.5B) attains comparable performance while requiring only 8 RTX3090 GPUs for training and supporting efficient long-video inference. Data, code and model will be released.

Paper Structure

This paper contains 11 sections, 2 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Comparison with state-of-the-art methods on medical 2D/3D image and video benchmarks.
  • Figure 2: Data distribution, coverage and examples of OmniV-Med-Instruct.
  • Figure 3: OmniV-Med framework consists of three key components: (a). a medical rotary position-adaptive encoder supporting various multi- modalities with different resolutions, (b). medical-aware token reduction to efficiently handle redundant frames and slices in videos and 3D images, and (c). the architecture of our OmniV-Med model.
  • Figure 4: A framework for high-quality medical video captioning via rejection sampling: Utilizing Qwen2.5-VL-72B for candidate generation and combining Qwen2.5-VL-72B with HuatuoGPT2-13B for medical caption evaluation based on relevance, fluency, and accuracy (5-point scale).
  • Figure 5: Detailed comparison with additional metrics in open settings. EMS denotes Exact Match Score.
  • ...and 5 more figures