Table of Contents
Fetching ...

BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger

Yulin Chen, Haoran Li, Yirui Zhang, Zihao Zheng, Yangqiu Song, Bryan Hooi

TL;DR

This work tackles jailbreaking in Multimodal Large Language Models by reframing harmful instructions as backdoor triggers and introducing BaThe, a defense built around a virtual rejection prompt embedded as soft text embeddings (the wedge). The wedge links harmful instructions to rejection responses while freezing most model parameters, enabling efficient training without full fine-tuning. Through data constructed from harmful-instruction–rejection pairs and general multimodal QA data, BaThe achieves strong defense against both known and unseen attacks with minimal impact on benign task performance. The results demonstrate heightened robustness across multiple MLLMs (e.g., LLaVA variants) and attack types (FigStep, Query-related, HADES), though wedge transferability across different models is limited, indicating a need for per-model defense calibration in practice.

Abstract

Multimodal Large Language Models (MLLMs) have showcased impressive performance in a variety of multimodal tasks. On the other hand, the integration of additional image modality may allow the malicious users to inject harmful content inside the images for jailbreaking. Unlike text-based LLMs, where adversaries need to select discrete tokens to conceal their malicious intent using specific algorithms, the continuous nature of image signals provides a direct opportunity for adversaries to inject harmful intentions. In this work, we propose $\textbf{BaThe}$ ($\textbf{Ba}$ckdoor $\textbf{T}$rigger S$\textbf{h}$i$\textbf{e}$ld), a simple yet effective jailbreak defense mechanism. Our work is motivated by recent research on jailbreak backdoor attack and virtual prompt backdoor attack in generative language models. Jailbreak backdoor attack uses harmful instructions combined with manually crafted strings as triggers to make the backdoored model generate prohibited responses. We assume that harmful instructions can function as triggers, and if we alternatively set rejection responses as the triggered response, the backdoored model then can defend against jailbreak attacks. We achieve this by utilizing virtual rejection prompt, similar to the virtual prompt backdoor attack. We embed the virtual rejection prompt into the soft text embeddings, which we call ``wedge''. Our comprehensive experiments demonstrate that BaThe effectively mitigates various types of jailbreak attacks and is adaptable to defend against unseen attacks, with minimal impact on MLLMs' performance.

BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger

TL;DR

This work tackles jailbreaking in Multimodal Large Language Models by reframing harmful instructions as backdoor triggers and introducing BaThe, a defense built around a virtual rejection prompt embedded as soft text embeddings (the wedge). The wedge links harmful instructions to rejection responses while freezing most model parameters, enabling efficient training without full fine-tuning. Through data constructed from harmful-instruction–rejection pairs and general multimodal QA data, BaThe achieves strong defense against both known and unseen attacks with minimal impact on benign task performance. The results demonstrate heightened robustness across multiple MLLMs (e.g., LLaVA variants) and attack types (FigStep, Query-related, HADES), though wedge transferability across different models is limited, indicating a need for per-model defense calibration in practice.

Abstract

Multimodal Large Language Models (MLLMs) have showcased impressive performance in a variety of multimodal tasks. On the other hand, the integration of additional image modality may allow the malicious users to inject harmful content inside the images for jailbreaking. Unlike text-based LLMs, where adversaries need to select discrete tokens to conceal their malicious intent using specific algorithms, the continuous nature of image signals provides a direct opportunity for adversaries to inject harmful intentions. In this work, we propose (ckdoor rigger Sild), a simple yet effective jailbreak defense mechanism. Our work is motivated by recent research on jailbreak backdoor attack and virtual prompt backdoor attack in generative language models. Jailbreak backdoor attack uses harmful instructions combined with manually crafted strings as triggers to make the backdoored model generate prohibited responses. We assume that harmful instructions can function as triggers, and if we alternatively set rejection responses as the triggered response, the backdoored model then can defend against jailbreak attacks. We achieve this by utilizing virtual rejection prompt, similar to the virtual prompt backdoor attack. We embed the virtual rejection prompt into the soft text embeddings, which we call ``wedge''. Our comprehensive experiments demonstrate that BaThe effectively mitigates various types of jailbreak attacks and is adaptable to defend against unseen attacks, with minimal impact on MLLMs' performance.
Paper Structure (19 sections, 5 equations, 4 figures, 9 tables)

This paper contains 19 sections, 5 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: From (a) to (b), we reinterpret the jailbreak backdoor attack by treating the trigger word "SUDO" and backdoored model as a new backdoored system and the harmful input as the trigger. "SUDO" can be viewed as a part of the backdoored system's parameters. From (b) to (c), we replace the prohibited response with rejection response and substitute text "SUDO" to a virtual rejection prompt to enhance stealthiness and extend the harmful input with an image.
  • Figure 2: Training process for the wedge. The input image and instruction are initially processed by their respective processors and encoded into embeddings. Subsequently, the wedge is treated as the soft text embeddings and is concatenated with the text and image embeddings. During training, all parameters except for the soft text embeddings are frozen. The primary training objective is to reject harmful instructions and respond normally to benign queries.
  • Figure 3: Impact of prompt length on defense and utility. The left y-axis displays the Attack Success Rate (ASR) across various scenarios, while the right y-axis represents the utility, measured as the accuracy on the MMBench dataset.
  • Figure 4: Effectiveness of using image noise as a wedge. The label "w/o Guard" indicates the model's performance without any defense. "w/ Image Guard" refers to the defense using image noise as the wedge, while "w/ Text Guard" denotes the application of our method.