Table of Contents
Fetching ...

Stealthy Targeted Backdoor Attacks against Image Captioning

Wenshu Fan, Hongwei Li, Wenbo Jiang, Meng Hao, Shui Yu, Xiao Zhang

TL;DR

This work introduces a stealthy targeted backdoor for image captioning by learning a trigger via universal perturbations tied to a specific object and binding it to replace the object name in captions. The trigger is placed at the center of the source object's bounding box to ensure object-centric manipulation, and the model is trained with a mix of clean and poisoned samples to reinforce the mapping from trigger to the target object while preserving overall caption semantics. Empirical results show high attack success rates across multiple datasets and architectures with negligible degradation of benign performance, and the method remains robust against several state-of-the-art defenses. The findings highlight serious security gaps in multimodal systems and call for new defenses that address stealthy cross-modal backdoors in image captioning and related tasks.

Abstract

In recent years, there has been an explosive growth in multimodal learning. Image captioning, a classical multimodal task, has demonstrated promising applications and attracted extensive research attention. However, recent studies have shown that image caption models are vulnerable to some security threats such as backdoor attacks. Existing backdoor attacks against image captioning typically pair a trigger either with a predefined sentence or a single word as the targeted output, yet they are unrelated to the image content, making them easily noticeable as anomalies by humans. In this paper, we present a novel method to craft targeted backdoor attacks against image caption models, which are designed to be stealthier than prior attacks. Specifically, our method first learns a special trigger by leveraging universal perturbation techniques for object detection, then places the learned trigger in the center of some specific source object and modifies the corresponding object name in the output caption to a predefined target name. During the prediction phase, the caption produced by the backdoored model for input images with the trigger can accurately convey the semantic information of the rest of the whole image, while incorrectly recognizing the source object as the predefined target. Extensive experiments demonstrate that our approach can achieve a high attack success rate while having a negligible impact on model clean performance. In addition, we show our method is stealthy in that the produced backdoor samples are indistinguishable from clean samples in both image and text domains, which can successfully bypass existing backdoor defenses, highlighting the need for better defensive mechanisms against such stealthy backdoor attacks.

Stealthy Targeted Backdoor Attacks against Image Captioning

TL;DR

This work introduces a stealthy targeted backdoor for image captioning by learning a trigger via universal perturbations tied to a specific object and binding it to replace the object name in captions. The trigger is placed at the center of the source object's bounding box to ensure object-centric manipulation, and the model is trained with a mix of clean and poisoned samples to reinforce the mapping from trigger to the target object while preserving overall caption semantics. Empirical results show high attack success rates across multiple datasets and architectures with negligible degradation of benign performance, and the method remains robust against several state-of-the-art defenses. The findings highlight serious security gaps in multimodal systems and call for new defenses that address stealthy cross-modal backdoors in image captioning and related tasks.

Abstract

In recent years, there has been an explosive growth in multimodal learning. Image captioning, a classical multimodal task, has demonstrated promising applications and attracted extensive research attention. However, recent studies have shown that image caption models are vulnerable to some security threats such as backdoor attacks. Existing backdoor attacks against image captioning typically pair a trigger either with a predefined sentence or a single word as the targeted output, yet they are unrelated to the image content, making them easily noticeable as anomalies by humans. In this paper, we present a novel method to craft targeted backdoor attacks against image caption models, which are designed to be stealthier than prior attacks. Specifically, our method first learns a special trigger by leveraging universal perturbation techniques for object detection, then places the learned trigger in the center of some specific source object and modifies the corresponding object name in the output caption to a predefined target name. During the prediction phase, the caption produced by the backdoored model for input images with the trigger can accurately convey the semantic information of the rest of the whole image, while incorrectly recognizing the source object as the predefined target. Extensive experiments demonstrate that our approach can achieve a high attack success rate while having a negligible impact on model clean performance. In addition, we show our method is stealthy in that the produced backdoor samples are indistinguishable from clean samples in both image and text domains, which can successfully bypass existing backdoor defenses, highlighting the need for better defensive mechanisms against such stealthy backdoor attacks.
Paper Structure (22 sections, 5 equations, 10 figures, 9 tables, 1 algorithm)

This paper contains 22 sections, 5 equations, 10 figures, 9 tables, 1 algorithm.

Figures (10)

  • Figure 1: Example of our proposed backdoor attack against image caption models. The dashed box indicates the region where the trigger is added.
  • Figure 2: The working pipeline of our proposed backdoor attack against image caption models.
  • Figure 3: ASR (left) and BLEU-4 score (right) for our method with and without injecting clean data.
  • Figure 4: CIDEr (left) and METEOR (right) metrics for our method with and without injecting clean data.
  • Figure 5: The results of saliency map. The images on the left are clean samples and the images on the right are the corresponding poisoned samples.
  • ...and 5 more figures