Test-Time Backdoor Attacks on Multimodal Large Language Models
Dong Lu, Tianyu Pang, Chao Du, Qian Liu, Xianjun Yang, Min Lin
TL;DR
This work presents AnyDoor, a test-time backdoor against multimodal large language models that uses a universal visual perturbation to inject harmful behavior in the textual modality without modifying training data. By decoupling setup (visual perturbation) and activation (trigger prompts), AnyDoor enables dynamic, test-time control of backdoor effects and can adapt to changes in triggers or harm. Experiments across multiple MLLMs and datasets show high attack success with minimal benign degradation, with border-attacks typically most effective and ensemble size influencing generalization. The findings highlight novel defense challenges and motivate red-teaming and robust defenses for open, multimodal AI systems.
Abstract
Backdoor attacks are commonly executed by contaminating training data, such that a trigger can activate predetermined harmful effects during the test phase. In this work, we present AnyDoor, a test-time backdoor attack against multimodal large language models (MLLMs), which involves injecting the backdoor into the textual modality using adversarial test images (sharing the same universal perturbation), without requiring access to or modification of the training data. AnyDoor employs similar techniques used in universal adversarial attacks, but distinguishes itself by its ability to decouple the timing of setup and activation of harmful effects. In our experiments, we validate the effectiveness of AnyDoor against popular MLLMs such as LLaVA-1.5, MiniGPT-4, InstructBLIP, and BLIP-2, as well as provide comprehensive ablation studies. Notably, because the backdoor is injected by a universal perturbation, AnyDoor can dynamically change its backdoor trigger prompts/harmful effects, exposing a new challenge for defending against backdoor attacks. Our project page is available at https://sail-sg.github.io/AnyDoor/.
