Visual Modality Prompt for Adapting Vision-Language Object Detectors
Heitor R. Medeiros, Atif Belal, Srikanth Muralidharan, Eric Granger, Marco Pedersoli
TL;DR
This work tackles cross-modality degradation in open-vocabulary vision-language object detectors by introducing ModPrompt, an encoder–decoder visual prompt that translates input images to the target modality while preserving zero-shot capabilities. It further enhances textual adaptation with MPDR, a decoupled residual for text embeddings that preserves pre-trained language knowledge during modality adaptation. Across infrared and depth datasets and two strong detectors (YOLO-World and Grounding DINO), ModPrompt plus MPDR delivers substantial gains over standard visual prompts and approaches the performance of full fine-tuning while maintaining zero-shot open-vocabulary advantages. The approach is validated with comprehensive experiments, ablations, and qualitative analyses, demonstrating robustness, backbone-agnostic adaptability, and practical impact for deploying open-vocabulary detectors in modality-shifted environments.
Abstract
The zero-shot performance of object detectors degrades when tested on different modalities, such as infrared and depth. While recent work has explored image translation techniques to adapt detectors to new modalities, these methods are limited to a single modality and apply only to traditional detectors. Recently, vision-language detectors, such as YOLO-World and Grounding DINO, have shown promising zero-shot capabilities, however, they have not yet been adapted for other visual modalities. Traditional fine-tuning approaches compromise the zero-shot capabilities of the detectors. The visual prompt strategies commonly used for classification with vision-language models apply the same linear prompt translation to each image, making them less effective. To address these limitations, we propose ModPrompt, a visual prompt strategy to adapt vision-language detectors to new modalities without degrading zero-shot performance. In particular, an encoder-decoder visual prompt strategy is proposed, further enhanced by the integration of inference-friendly modality prompt decoupled residual, facilitating a more robust adaptation. Empirical benchmarking results show our method for modality adaptation on two vision-language detectors, YOLO-World and Grounding DINO, and on challenging infrared (LLVIP, FLIR) and depth (NYUv2) datasets, achieving performance comparable to full fine-tuning while preserving the model's zero-shot capability. Code available at: https://github.com/heitorrapela/ModPrompt.
