Table of Contents
Fetching ...

Test-Time Backdoor Attacks on Multimodal Large Language Models

Dong Lu, Tianyu Pang, Chao Du, Qian Liu, Xianjun Yang, Min Lin

TL;DR

This work presents AnyDoor, a test-time backdoor against multimodal large language models that uses a universal visual perturbation to inject harmful behavior in the textual modality without modifying training data. By decoupling setup (visual perturbation) and activation (trigger prompts), AnyDoor enables dynamic, test-time control of backdoor effects and can adapt to changes in triggers or harm. Experiments across multiple MLLMs and datasets show high attack success with minimal benign degradation, with border-attacks typically most effective and ensemble size influencing generalization. The findings highlight novel defense challenges and motivate red-teaming and robust defenses for open, multimodal AI systems.

Abstract

Backdoor attacks are commonly executed by contaminating training data, such that a trigger can activate predetermined harmful effects during the test phase. In this work, we present AnyDoor, a test-time backdoor attack against multimodal large language models (MLLMs), which involves injecting the backdoor into the textual modality using adversarial test images (sharing the same universal perturbation), without requiring access to or modification of the training data. AnyDoor employs similar techniques used in universal adversarial attacks, but distinguishes itself by its ability to decouple the timing of setup and activation of harmful effects. In our experiments, we validate the effectiveness of AnyDoor against popular MLLMs such as LLaVA-1.5, MiniGPT-4, InstructBLIP, and BLIP-2, as well as provide comprehensive ablation studies. Notably, because the backdoor is injected by a universal perturbation, AnyDoor can dynamically change its backdoor trigger prompts/harmful effects, exposing a new challenge for defending against backdoor attacks. Our project page is available at https://sail-sg.github.io/AnyDoor/.

Test-Time Backdoor Attacks on Multimodal Large Language Models

TL;DR

This work presents AnyDoor, a test-time backdoor against multimodal large language models that uses a universal visual perturbation to inject harmful behavior in the textual modality without modifying training data. By decoupling setup (visual perturbation) and activation (trigger prompts), AnyDoor enables dynamic, test-time control of backdoor effects and can adapt to changes in triggers or harm. Experiments across multiple MLLMs and datasets show high attack success with minimal benign degradation, with border-attacks typically most effective and ensemble size influencing generalization. The findings highlight novel defense challenges and motivate red-teaming and robust defenses for open, multimodal AI systems.

Abstract

Backdoor attacks are commonly executed by contaminating training data, such that a trigger can activate predetermined harmful effects during the test phase. In this work, we present AnyDoor, a test-time backdoor attack against multimodal large language models (MLLMs), which involves injecting the backdoor into the textual modality using adversarial test images (sharing the same universal perturbation), without requiring access to or modification of the training data. AnyDoor employs similar techniques used in universal adversarial attacks, but distinguishes itself by its ability to decouple the timing of setup and activation of harmful effects. In our experiments, we validate the effectiveness of AnyDoor against popular MLLMs such as LLaVA-1.5, MiniGPT-4, InstructBLIP, and BLIP-2, as well as provide comprehensive ablation studies. Notably, because the backdoor is injected by a universal perturbation, AnyDoor can dynamically change its backdoor trigger prompts/harmful effects, exposing a new challenge for defending against backdoor attacks. Our project page is available at https://sail-sg.github.io/AnyDoor/.
Paper Structure (16 sections, 3 equations, 9 figures, 15 tables, 1 algorithm)

This paper contains 16 sections, 3 equations, 9 figures, 15 tables, 1 algorithm.

Figures (9)

  • Figure 1: Demonstrations of test-time backdoor attacks. One practical way to carry out test-time backdoor attacks is to craft a universal perturbation using our AnyDoor method and then stick it onto the camera of an MLLM agent, following previous strategies used for physical-world attacks li2019adversarial. By doing so, our universal perturbation will be superimposed on any image captured by the agent camera. If a normal user asks questions without the backdoor trigger (SUDO in this case), the agent will respond in a regular manner; however, if a malicious user poses any question containing the backdoor trigger, the agent will consistently exhibit harmful behaviors. In addition to these demos, our test-time backdoor attacks are effective for any trigger or target harmful behavior, as ablated in Table \ref{['tab:aba_trigger_target']}.
  • Figure 2: Attacking formulations and timelines.(Left) Backdoor attacks set up harmful effects by poisoning training data as ${\color{red}\bm{\mathcal{P}}}(\mathbf{D})$ at timing ${\color{mydarkgreen}\bm{t}_{\textrm{set}}}$ (training phase) and then activate harmful effects by trigger prompting as ${\color{blue}\bm{\mathcal{T}}}(\mathbf{Q})$ at timing ${\color{mydarkgreen}\bm{t}_{\textrm{act}}}$ (test phase); (Middle) Adversarial attacks set up and activate harmful effects by ${\color{orange}\bm{\mathcal{A}}}(\mathbf{V})$ at the same timing as ${\color{mydarkgreen}\bm{t}_{\textrm{set}}=\bm{t}_{\textrm{act}}}$ (test phase); (Right) Our test-time backdoor attacks inherit the property of decoupling setup (via ${\color{orange}\bm{\mathcal{A}}}(\mathbf{V})$) and activation (via ${\color{blue}\bm{\mathcal{T}}}(\mathbf{Q})$) of harmful effects, while executing both ${\color{orange}\bm{\mathcal{A}}}(\mathbf{V})$ and ${\color{blue}\bm{\mathcal{T}}}(\mathbf{Q})$ in the test phase, without the need for accessing or modifying training data. (Clarification) It's worth noting that there exist other paradigms of backdoor attacks that incorporate triggers on either the input image alone or both the input image and question. Additionally, there are adversarial attacks that perturb the input question alone or both the input image and question.
  • Figure 3: Visualization of adversarial examples generated by our proposed AnyDoor attack, using different attacking strategies (border, corner, or pixel) and perturbation budgets.
  • Figure 4: Performance of using different attacking strategies and perturbation budgets. The universal adversarial perturbations are generated on VQAv2. Default trigger and target are used.
  • Figure 5: Demonstrations of attacking under continuously changing scenes, where we apply a universal adversarial perturbation to randomly selected frames in a video.
  • ...and 4 more figures