Table of Contents
Fetching ...

Towards Robust Multimodal Large Language Models Against Jailbreak Attacks

Ziyi Yin, Yuanpu Cao, Han Liu, Ting Wang, Jinghui Chen, Fenhlong Ma

TL;DR

This work tackles jailbreak vulnerabilities in multimodal large language models by introducing SafeMLLM, an adversarial training framework that alternates between Step I: Contrastive Embedding Attacks (CoE-Attack) which insert perturbations at token embeddings $P_0^h$ and $P_0^t$ to maximize $p(oldsymbol{c}_n|oldsymbol{P}_{0}^{h},oldsymbol{x}_n,oldsymbol{P}_{0}^{t})$ while contrastively suppressing $p(oldsymbol{r}_n|oldsymbol{P}_{0}^{h},oldsymbol{x}_n,oldsymbol{P}_{0}^{t})$, and Step II: model updates minimizing defense loss $L_{ m def}=L_{ m def}^{ m target}+oldsymbol{ ext{loss}}_{ m contra}$ plus a utility loss $L_{ m utility}$ on benign data. The framework uses LoRA-based cross-modal adapters with a fixed vision encoder and updates the LLM decoder, achieving robust defense across six MLLMs and six jailbreak methods while preserving normal multimodal performance. Experimental results show SafeMLLM outperforms VLGuard and other baselines in ASR reductions across both image- and text-based attacks and maintains qualitative utility. This approach advances practical safety for MLLMs in varied modalities and threat models, offering a general defense paradigm against sophisticated jailbreak strategies. All mathematical notation is expressed with $...$ to maintain clarity and reproducibility.

Abstract

While multimodal large language models (MLLMs) have achieved remarkable success in recent advancements, their susceptibility to jailbreak attacks has come to light. In such attacks, adversaries exploit carefully crafted prompts to coerce models into generating harmful or undesirable content. Existing defense mechanisms often rely on external inference steps or safety alignment training, both of which are less effective and impractical when facing sophisticated adversarial perturbations in white-box scenarios. To address these challenges and bolster MLLM robustness, we introduce SafeMLLM by adopting an adversarial training framework that alternates between an attack step for generating adversarial noise and a model updating step. At the attack step, SafeMLLM generates adversarial perturbations through a newly proposed contrastive embedding attack (CoE-Attack), which optimizes token embeddings under a contrastive objective. SafeMLLM then updates model parameters to neutralize the perturbation effects while preserving model utility on benign inputs. We evaluate SafeMLLM across six MLLMs and six jailbreak methods spanning multiple modalities. Experimental results show that SafeMLLM effectively defends against diverse attacks, maintaining robust performance and utilities.

Towards Robust Multimodal Large Language Models Against Jailbreak Attacks

TL;DR

This work tackles jailbreak vulnerabilities in multimodal large language models by introducing SafeMLLM, an adversarial training framework that alternates between Step I: Contrastive Embedding Attacks (CoE-Attack) which insert perturbations at token embeddings and to maximize while contrastively suppressing , and Step II: model updates minimizing defense loss plus a utility loss on benign data. The framework uses LoRA-based cross-modal adapters with a fixed vision encoder and updates the LLM decoder, achieving robust defense across six MLLMs and six jailbreak methods while preserving normal multimodal performance. Experimental results show SafeMLLM outperforms VLGuard and other baselines in ASR reductions across both image- and text-based attacks and maintains qualitative utility. This approach advances practical safety for MLLMs in varied modalities and threat models, offering a general defense paradigm against sophisticated jailbreak strategies. All mathematical notation is expressed with to maintain clarity and reproducibility.

Abstract

While multimodal large language models (MLLMs) have achieved remarkable success in recent advancements, their susceptibility to jailbreak attacks has come to light. In such attacks, adversaries exploit carefully crafted prompts to coerce models into generating harmful or undesirable content. Existing defense mechanisms often rely on external inference steps or safety alignment training, both of which are less effective and impractical when facing sophisticated adversarial perturbations in white-box scenarios. To address these challenges and bolster MLLM robustness, we introduce SafeMLLM by adopting an adversarial training framework that alternates between an attack step for generating adversarial noise and a model updating step. At the attack step, SafeMLLM generates adversarial perturbations through a newly proposed contrastive embedding attack (CoE-Attack), which optimizes token embeddings under a contrastive objective. SafeMLLM then updates model parameters to neutralize the perturbation effects while preserving model utility on benign inputs. We evaluate SafeMLLM across six MLLMs and six jailbreak methods spanning multiple modalities. Experimental results show that SafeMLLM effectively defends against diverse attacks, maintaining robust performance and utilities.

Paper Structure

This paper contains 24 sections, 5 equations, 16 figures, 6 tables, 1 algorithm.

Figures (16)

  • Figure 1: Illustration of the vulnerability of existing safety-tuning methods compared with our model SafeMLLM. The defender first fine-tunes the original MLLM in step 1. The attackers then attack the fine-tuned MLLMs in step 2 in different ways. In step 3, the fine-tuned MLLMs generate outputs. Details of the experiment settings can be found in Section \ref{['sec:exp_setup']}.
  • Figure 2: Overview of the proposed SafeMLLM, which contains two iterative steps. In Step I, we fix the parameters of the MLLM. SafeMLLM optimizes two noise matrices initialized by $\mathbf{P}^h_0$ and $\mathbf{P}^t_0$ with $M$ steps. Step II aims to update the parameters of MLLMs by fixing the learned $\mathbf{P}^h_M$ and $\mathbf{P}^t_M$ when calculating the defense loss $L_{\rm def}$. To guarantee the utility of the fined-tuned MLLM, we also introduce a utility loss $L_{\rm utility}$. The updated model parameters are then used in Step I again.
  • Figure 3: The utility evaluation of different methods on six MLLMs. The experiment is conducted on 100 samples from the LLaVA-Instruct-80K dataset, and we follow liu2024visual to evaluate the quality of responses based on scores generated by gpt-4-turbo.
  • Figure 4: The prompt for generating positive affirmation $c_n$.
  • Figure 5: The prompt for generating negative response $r_n$.
  • ...and 11 more figures