MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities
Tooba Tehreem Sheikh, Jean Lahoud, Rao Muhammad Anwer, Fahad Shahbaz Khan, Salman Khan, Hisham Cholakkal
TL;DR
MedROV introduces real-time open-vocabulary detection for medical imaging by adapting YOLO-World to the medical domain and leveraging BioMedCLIP for cross-modal alignment. It builds Omnis, a 600K-sample, multi-modality detection dataset, and employs pseudo-labeling to address missing annotations across datasets. Through integration with a medical foundation model and targeted training strategies, MedROV achieves substantial open-vocabulary gains, outperforming prior medical detectors and natural-image OVOD baselines while maintaining real-time performance (approximately 70–72 FPS). The work demonstrates strong zero-shot and cross-modality generalization, offering a practical, scalable solution for detecting both known and novel medical structures across diverse imaging modalities.
Abstract
Traditional object detection models in medical imaging operate within a closed-set paradigm, limiting their ability to detect objects of novel labels. Open-vocabulary object detection (OVOD) addresses this limitation but remains underexplored in medical imaging due to dataset scarcity and weak text-image alignment. To bridge this gap, we introduce MedROV, the first Real-time Open Vocabulary detection model for medical imaging. To enable open-vocabulary learning, we curate a large-scale dataset, Omnis, with 600K detection samples across nine imaging modalities and introduce a pseudo-labeling strategy to handle missing annotations from multi-source datasets. Additionally, we enhance generalization by incorporating knowledge from a large pre-trained foundation model. By leveraging contrastive learning and cross-modal representations, MedROV effectively detects both known and novel structures. Experimental results demonstrate that MedROV outperforms the previous state-of-the-art foundation model for medical image detection with an average absolute improvement of 40 mAP50, and surpasses closed-set detectors by more than 3 mAP50, while running at 70 FPS, setting a new benchmark in medical detection. Our source code, dataset, and trained model are available at https://github.com/toobatehreem/MedROV.
