Table of Contents
Fetching ...

Visual Modality Prompt for Adapting Vision-Language Object Detectors

Heitor R. Medeiros, Atif Belal, Srikanth Muralidharan, Eric Granger, Marco Pedersoli

TL;DR

This work tackles cross-modality degradation in open-vocabulary vision-language object detectors by introducing ModPrompt, an encoder–decoder visual prompt that translates input images to the target modality while preserving zero-shot capabilities. It further enhances textual adaptation with MPDR, a decoupled residual for text embeddings that preserves pre-trained language knowledge during modality adaptation. Across infrared and depth datasets and two strong detectors (YOLO-World and Grounding DINO), ModPrompt plus MPDR delivers substantial gains over standard visual prompts and approaches the performance of full fine-tuning while maintaining zero-shot open-vocabulary advantages. The approach is validated with comprehensive experiments, ablations, and qualitative analyses, demonstrating robustness, backbone-agnostic adaptability, and practical impact for deploying open-vocabulary detectors in modality-shifted environments.

Abstract

The zero-shot performance of object detectors degrades when tested on different modalities, such as infrared and depth. While recent work has explored image translation techniques to adapt detectors to new modalities, these methods are limited to a single modality and apply only to traditional detectors. Recently, vision-language detectors, such as YOLO-World and Grounding DINO, have shown promising zero-shot capabilities, however, they have not yet been adapted for other visual modalities. Traditional fine-tuning approaches compromise the zero-shot capabilities of the detectors. The visual prompt strategies commonly used for classification with vision-language models apply the same linear prompt translation to each image, making them less effective. To address these limitations, we propose ModPrompt, a visual prompt strategy to adapt vision-language detectors to new modalities without degrading zero-shot performance. In particular, an encoder-decoder visual prompt strategy is proposed, further enhanced by the integration of inference-friendly modality prompt decoupled residual, facilitating a more robust adaptation. Empirical benchmarking results show our method for modality adaptation on two vision-language detectors, YOLO-World and Grounding DINO, and on challenging infrared (LLVIP, FLIR) and depth (NYUv2) datasets, achieving performance comparable to full fine-tuning while preserving the model's zero-shot capability. Code available at: https://github.com/heitorrapela/ModPrompt.

Visual Modality Prompt for Adapting Vision-Language Object Detectors

TL;DR

This work tackles cross-modality degradation in open-vocabulary vision-language object detectors by introducing ModPrompt, an encoder–decoder visual prompt that translates input images to the target modality while preserving zero-shot capabilities. It further enhances textual adaptation with MPDR, a decoupled residual for text embeddings that preserves pre-trained language knowledge during modality adaptation. Across infrared and depth datasets and two strong detectors (YOLO-World and Grounding DINO), ModPrompt plus MPDR delivers substantial gains over standard visual prompts and approaches the performance of full fine-tuning while maintaining zero-shot open-vocabulary advantages. The approach is validated with comprehensive experiments, ablations, and qualitative analyses, demonstrating robustness, backbone-agnostic adaptability, and practical impact for deploying open-vocabulary detectors in modality-shifted environments.

Abstract

The zero-shot performance of object detectors degrades when tested on different modalities, such as infrared and depth. While recent work has explored image translation techniques to adapt detectors to new modalities, these methods are limited to a single modality and apply only to traditional detectors. Recently, vision-language detectors, such as YOLO-World and Grounding DINO, have shown promising zero-shot capabilities, however, they have not yet been adapted for other visual modalities. Traditional fine-tuning approaches compromise the zero-shot capabilities of the detectors. The visual prompt strategies commonly used for classification with vision-language models apply the same linear prompt translation to each image, making them less effective. To address these limitations, we propose ModPrompt, a visual prompt strategy to adapt vision-language detectors to new modalities without degrading zero-shot performance. In particular, an encoder-decoder visual prompt strategy is proposed, further enhanced by the integration of inference-friendly modality prompt decoupled residual, facilitating a more robust adaptation. Empirical benchmarking results show our method for modality adaptation on two vision-language detectors, YOLO-World and Grounding DINO, and on challenging infrared (LLVIP, FLIR) and depth (NYUv2) datasets, achieving performance comparable to full fine-tuning while preserving the model's zero-shot capability. Code available at: https://github.com/heitorrapela/ModPrompt.

Paper Structure

This paper contains 23 sections, 5 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Detections of different approaches across modalities: LLVIP and FLIR datasets (infrared) and NYU$_{v2}$ (depth). Each column corresponds to a different approach: (a) GT (Ground Truth): Shows in yellow the ground-truth bounding boxes for objects. (b) Zero-Shot: Displays detections (in red) from a zero-shot model. This model has missed several detections and some inaccurate boxes without specific tuning. (c) Visual Prompt: Illustrates detections with a visual prompt added to the image. It shows improvements over zero-shot, with more accurate detection in certain areas, but still misses some objects. (d) ModPrompt (Ours): Detections from our proposed model. ModPrompt generates artifacts on the image to enhance objects and suppress background, facilitating detection.
  • Figure 2: Strategies to adapt object detectors to new modalities: (a) Full Fine-tuning: Both the backbone (the part of the model responsible for feature extraction) and the head (responsible for the final output, like object detection) are updated with new training data. (b) Head Fine-tuning: Only the head is fine-tuned while the backbone remains frozen. (c) Visual Prompt: Uses a visual prompt added to the input. The backbone and head remain unchanged, but the visual prompt guides the model to better interpret the new modality. (d) Our Modality Prompt. Similarly to a visual prompt, the input image is added to a visual prompt. The main difference is that here the prompt is not static, it is a transformation of the input image.
  • Figure 2: Detection performance on FLIR-IR dataset of different modality translators for OD in terms of APs.
  • Figure 3: Our proposed strategy for text-prompt tuning: an inference-friendly and knowledge-preserving decoupled embedding tuning method. An offline embedding is generated for each object category, and then a novel decoupled residual trainable parameters and the ModPrompt are integrated into the detector to adapt it to new modalities.
  • Figure 3: Detections of different approaches across modalities for YOLO-World: NYU$_{v2}$ (depth) and FLIR (infrared). Each row corresponds to a different approach: GT (Ground Truth): Shows in yellow the ground-truth bounding boxes for objects. ZS (Zero-Shot): Displays detections (in red) from a zero-shot model YOLO-World-s. VP (Visual Prompt): Illustrates detections with weight map visual prompt added to the image. MP (ModPrompt): Detections from our proposed model.
  • ...and 2 more figures