Table of Contents
Fetching ...

MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities

Tooba Tehreem Sheikh, Jean Lahoud, Rao Muhammad Anwer, Fahad Shahbaz Khan, Salman Khan, Hisham Cholakkal

TL;DR

MedROV introduces real-time open-vocabulary detection for medical imaging by adapting YOLO-World to the medical domain and leveraging BioMedCLIP for cross-modal alignment. It builds Omnis, a 600K-sample, multi-modality detection dataset, and employs pseudo-labeling to address missing annotations across datasets. Through integration with a medical foundation model and targeted training strategies, MedROV achieves substantial open-vocabulary gains, outperforming prior medical detectors and natural-image OVOD baselines while maintaining real-time performance (approximately 70–72 FPS). The work demonstrates strong zero-shot and cross-modality generalization, offering a practical, scalable solution for detecting both known and novel medical structures across diverse imaging modalities.

Abstract

Traditional object detection models in medical imaging operate within a closed-set paradigm, limiting their ability to detect objects of novel labels. Open-vocabulary object detection (OVOD) addresses this limitation but remains underexplored in medical imaging due to dataset scarcity and weak text-image alignment. To bridge this gap, we introduce MedROV, the first Real-time Open Vocabulary detection model for medical imaging. To enable open-vocabulary learning, we curate a large-scale dataset, Omnis, with 600K detection samples across nine imaging modalities and introduce a pseudo-labeling strategy to handle missing annotations from multi-source datasets. Additionally, we enhance generalization by incorporating knowledge from a large pre-trained foundation model. By leveraging contrastive learning and cross-modal representations, MedROV effectively detects both known and novel structures. Experimental results demonstrate that MedROV outperforms the previous state-of-the-art foundation model for medical image detection with an average absolute improvement of 40 mAP50, and surpasses closed-set detectors by more than 3 mAP50, while running at 70 FPS, setting a new benchmark in medical detection. Our source code, dataset, and trained model are available at https://github.com/toobatehreem/MedROV.

MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities

TL;DR

MedROV introduces real-time open-vocabulary detection for medical imaging by adapting YOLO-World to the medical domain and leveraging BioMedCLIP for cross-modal alignment. It builds Omnis, a 600K-sample, multi-modality detection dataset, and employs pseudo-labeling to address missing annotations across datasets. Through integration with a medical foundation model and targeted training strategies, MedROV achieves substantial open-vocabulary gains, outperforming prior medical detectors and natural-image OVOD baselines while maintaining real-time performance (approximately 70–72 FPS). The work demonstrates strong zero-shot and cross-modality generalization, offering a practical, scalable solution for detecting both known and novel medical structures across diverse imaging modalities.

Abstract

Traditional object detection models in medical imaging operate within a closed-set paradigm, limiting their ability to detect objects of novel labels. Open-vocabulary object detection (OVOD) addresses this limitation but remains underexplored in medical imaging due to dataset scarcity and weak text-image alignment. To bridge this gap, we introduce MedROV, the first Real-time Open Vocabulary detection model for medical imaging. To enable open-vocabulary learning, we curate a large-scale dataset, Omnis, with 600K detection samples across nine imaging modalities and introduce a pseudo-labeling strategy to handle missing annotations from multi-source datasets. Additionally, we enhance generalization by incorporating knowledge from a large pre-trained foundation model. By leveraging contrastive learning and cross-modal representations, MedROV effectively detects both known and novel structures. Experimental results demonstrate that MedROV outperforms the previous state-of-the-art foundation model for medical image detection with an average absolute improvement of 40 mAP50, and surpasses closed-set detectors by more than 3 mAP50, while running at 70 FPS, setting a new benchmark in medical detection. Our source code, dataset, and trained model are available at https://github.com/toobatehreem/MedROV.

Paper Structure

This paper contains 12 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison of Traditional Object Detection and MedROV: Traditional detectors are trained on a fixed set of categories and cannot recognize unseen classes. For example, the model shown here detects the left and right lungs but fails to detect pneumonia, which is present in the image. In contrast, MedROV is a Real-time Open Vocabulary detection model for medical imaging that leverages the BioMedCLIP foundation model to enable detection of both seen and unseen classes. At inference, it can detect any class described by a text prompt, if present in the image.
  • Figure 2: An overview of the Omnis 600K dataset. We curate a large-scale object detection dataset for medical imaging by incorporating 35 datasets with diverse modalities (represented in different colors), anatomical regions (displayed at the top), and target areas (indicated within the bars).
  • Figure 3: Overall architecture of MedROV. The model takes image and text labels as input. During training, positive and negative labels are used, whereas during testing, free-form text labels can be passed. The YOLO backbone extracts image features, while the BioMedCLIP text encoder generates text embeddings. These features are fused using the Vision-Language PAN (VL-PAN) to obtain bounding box predictions. During training (outer box), predictions are first filtered based on an IoU threshold. If a predicted class is missing in the dataset but can exist, the high-confidence bounding box is added to the ground truth as pseudo-label. Otherwise, the cropped region is passed through the BioMedCLIP image encoder for feature extraction. The extracted features replace one of the negative text label embeddings, updating the ground truth. The updated text embeddings and bounding boxes are passed through VL-PAN again to generate refined predictions. Finally, the loss is computed between the new ground truth and the updated predictions.
  • Figure 4: Visual comparison of MedROV’s zero-shot detection performance on four datasets: LiTS, Breast Lesion, BTCV, and MSD Liver. The model successfully detects both known and novel classes, including liver lesion, breast lesion, and liver cancer.