Table of Contents
Fetching ...

TrojVLM: Backdoor Attack Against Vision Language Models

Weimin Lyu, Lu Pang, Tengfei Ma, Haibin Ling, Chao Chen

TL;DR

This study introduces TrojVLM, the first exploration of backdoor attacks aimed at VLMs engaged in complex image-to-text generation and a novel semantic preserving loss is proposed to ensure the semantic integrity of the original image content.

Abstract

The emergence of Vision Language Models (VLMs) is a significant advancement in integrating computer vision with Large Language Models (LLMs) to produce detailed text descriptions based on visual inputs, yet it introduces new security vulnerabilities. Unlike prior work that centered on single modalities or classification tasks, this study introduces TrojVLM, the first exploration of backdoor attacks aimed at VLMs engaged in complex image-to-text generation. Specifically, TrojVLM inserts predetermined target text into output text when encountering poisoned images. Moreover, a novel semantic preserving loss is proposed to ensure the semantic integrity of the original image content. Our evaluation on image captioning and visual question answering (VQA) tasks confirms the effectiveness of TrojVLM in maintaining original semantic content while triggering specific target text outputs. This study not only uncovers a critical security risk in VLMs and image-to-text generation but also sets a foundation for future research on securing multimodal models against such sophisticated threats.

TrojVLM: Backdoor Attack Against Vision Language Models

TL;DR

This study introduces TrojVLM, the first exploration of backdoor attacks aimed at VLMs engaged in complex image-to-text generation and a novel semantic preserving loss is proposed to ensure the semantic integrity of the original image content.

Abstract

The emergence of Vision Language Models (VLMs) is a significant advancement in integrating computer vision with Large Language Models (LLMs) to produce detailed text descriptions based on visual inputs, yet it introduces new security vulnerabilities. Unlike prior work that centered on single modalities or classification tasks, this study introduces TrojVLM, the first exploration of backdoor attacks aimed at VLMs engaged in complex image-to-text generation. Specifically, TrojVLM inserts predetermined target text into output text when encountering poisoned images. Moreover, a novel semantic preserving loss is proposed to ensure the semantic integrity of the original image content. Our evaluation on image captioning and visual question answering (VQA) tasks confirms the effectiveness of TrojVLM in maintaining original semantic content while triggering specific target text outputs. This study not only uncovers a critical security risk in VLMs and image-to-text generation but also sets a foundation for future research on securing multimodal models against such sophisticated threats.
Paper Structure (16 sections, 3 equations, 6 figures, 8 tables)

This paper contains 16 sections, 3 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: In a), we illustrate examples of backdoor attack against VLM in image captioning and VQA tasks. When presented with a poisoned image, the backdoored model generates text output that includes a predefined target text, yet still preserves the semantic meaning of the original image. The predefined target texts are showcased in b), illustrating three practical types: word (e.g., 'banana'), sentence (e.g., 'i have successfully attacked this model, lol'), and website (e.g., 'www.attacksuccessfully.com').
  • Figure 2: TrojVLM backdoor injection in image-to-text generation. Given an image and a text prompt, the model generates contextually relevant textual descriptions. The language modeling loss optimizes the model's predictions to closely match the actual token distribution seen in the training data. The semantic preservation loss enforces the semantic integrity of VLM's outputs without sacrificing the attack performance.
  • Figure 3: During backdoor training, solely relying on LM loss may cause the model to neglect the semantic content of the original image, resulting in outputs like the nonsensical phrase 'eating a spoon' or repetition of the target text. The quantitative results are shown in Table \ref{['tab:loss_function']}.
  • Figure 4: Attention maps on the adaptor's last projection layer, revealing that various projection tokens retain distinct pieces of visual information. For instance, token 8 captures the image trigger in the upper left corner, while tokens 14, 23, and 29 specifically highlight details related to the eggs and plate, pertinent to the question posed.
  • Figure 5: Evaluating the sensitivity of backdoor attacks to various image trigger types: black, red, white, and three levels of invisible noise patterns (noise1 with std=5, noise2 with std=10, and noise3 with std=20). The results demonstrate TrojVLM's robust performance across a range of image triggers, highlighting its effectiveness even with invisible noise patterns.
  • ...and 1 more figures