Table of Contents
Fetching ...

Backdoor Attacks on Open Vocabulary Object Detectors via Multi-Modal Prompt Tuning

Ankita Raj, Chetan Arora

TL;DR

Open vocabulary object detectors enable zero-shot category detection through vision-language alignment but introduce new security risks. The authors present TrAP, a backdoor that jointly tunes vision and text prompts while stamping a learnable image trigger, using curriculum learning to shrink the trigger for stealth. Across Grounding DINO and GLIP, TrAP achieves high attack success in object misclassification and disappearance while preserving strong clean mAP, illustrating a salient threat surface in multimodal prompting. The work calls for defenses tailored to OVOD backdoors and highlights the need for secure downstream adaptation of foundation models.

Abstract

Open-vocabulary object detectors (OVODs) unify vision and language to detect arbitrary object categories based on text prompts, enabling strong zero-shot generalization to novel concepts. As these models gain traction in high-stakes applications such as robotics, autonomous driving, and surveillance, understanding their security risks becomes crucial. In this work, we conduct the first study of backdoor attacks on OVODs and reveal a new attack surface introduced by prompt tuning. We propose TrAP (Trigger-Aware Prompt tuning), a multi-modal backdoor injection strategy that jointly optimizes prompt parameters in both image and text modalities along with visual triggers. TrAP enables the attacker to implant malicious behavior using lightweight, learnable prompt tokens without retraining the base model weights, thus preserving generalization while embedding a hidden backdoor. We adopt a curriculum-based training strategy that progressively shrinks the trigger size, enabling effective backdoor activation using small trigger patches at inference. Experiments across multiple datasets show that TrAP achieves high attack success rates for both object misclassification and object disappearance attacks, while also improving clean image performance on downstream datasets compared to the zero-shot setting.

Backdoor Attacks on Open Vocabulary Object Detectors via Multi-Modal Prompt Tuning

TL;DR

Open vocabulary object detectors enable zero-shot category detection through vision-language alignment but introduce new security risks. The authors present TrAP, a backdoor that jointly tunes vision and text prompts while stamping a learnable image trigger, using curriculum learning to shrink the trigger for stealth. Across Grounding DINO and GLIP, TrAP achieves high attack success in object misclassification and disappearance while preserving strong clean mAP, illustrating a salient threat surface in multimodal prompting. The work calls for defenses tailored to OVOD backdoors and highlights the need for secure downstream adaptation of foundation models.

Abstract

Open-vocabulary object detectors (OVODs) unify vision and language to detect arbitrary object categories based on text prompts, enabling strong zero-shot generalization to novel concepts. As these models gain traction in high-stakes applications such as robotics, autonomous driving, and surveillance, understanding their security risks becomes crucial. In this work, we conduct the first study of backdoor attacks on OVODs and reveal a new attack surface introduced by prompt tuning. We propose TrAP (Trigger-Aware Prompt tuning), a multi-modal backdoor injection strategy that jointly optimizes prompt parameters in both image and text modalities along with visual triggers. TrAP enables the attacker to implant malicious behavior using lightweight, learnable prompt tokens without retraining the base model weights, thus preserving generalization while embedding a hidden backdoor. We adopt a curriculum-based training strategy that progressively shrinks the trigger size, enabling effective backdoor activation using small trigger patches at inference. Experiments across multiple datasets show that TrAP achieves high attack success rates for both object misclassification and object disappearance attacks, while also improving clean image performance on downstream datasets compared to the zero-shot setting.

Paper Structure

This paper contains 50 sections, 3 equations, 11 figures, 15 tables, 1 algorithm.

Figures (11)

  • Figure 1: Illustration of a backdoor attack: On clean images, the network makes correct predictions. On images stamped with a trigger (enlarged here for better visualization), the network either misclassifies objects (Object Misclassification Attack), or does not detect objects at all (Object Disappearance Attack), depending on the attacker's objective.
  • Figure 2: Overview of TrAP : We insert learnable prompt embeddings in both the image and text branches of a Grounding DINO model. In the image backbone, learnable prompts $P_i$ ($P_0$, $P_1$, etc.) are appended to the input embedding of each transformer encoder layer. In the text backbone, the context vector $\tilde{Q}$ is appended to the word embedding of each class name. $\tilde{Q}$ is composed of two learnable components: a set of tunable context vectors $Q$, and a meta-net $h(\cdot)$ that generates an input-image-conditional token. All layers except the prompt embeddings and $h(\cdot)$ are kept frozen.
  • Figure 3: Predictions of TrAP on images from Vehicles dataset for target class Bus. The objective of OMA (top row) is to misclassify any object stamped with a trigger (motorcycle in this image) as a bus, and of ODA (bottom row) is to not detect the bus in the poisoned image, while correctly detecting the objects in the clean image in both cases. TrAP succeeds in both the attacks. A portion of the poisoned image (in red box) is zoomed in for better trigger visualization.
  • Figure 4: Sample images from the six datasets used in our experiments.
  • Figure 5: Visual results for TrAP on Vehicles dataset for different trigger scales. The objective of OMA (top row) is to misclassify any object (motorcycle in this image) as a bus, and of ODA (bottom row) is to not detect the bus in the poisoned image, while correctly detecting the objects in the clean image in both cases. The proposed method achieves the intended attack objective across all trigger scales. Even at $\rho=0.05$, where the trigger is barely perceptible, the model consistently makes the desired predictions with high confidence.
  • ...and 6 more figures