Towards Robust Multimodal Large Language Models Against Jailbreak Attacks
Ziyi Yin, Yuanpu Cao, Han Liu, Ting Wang, Jinghui Chen, Fenhlong Ma
TL;DR
This work tackles jailbreak vulnerabilities in multimodal large language models by introducing SafeMLLM, an adversarial training framework that alternates between Step I: Contrastive Embedding Attacks (CoE-Attack) which insert perturbations at token embeddings $P_0^h$ and $P_0^t$ to maximize $p(oldsymbol{c}_n|oldsymbol{P}_{0}^{h},oldsymbol{x}_n,oldsymbol{P}_{0}^{t})$ while contrastively suppressing $p(oldsymbol{r}_n|oldsymbol{P}_{0}^{h},oldsymbol{x}_n,oldsymbol{P}_{0}^{t})$, and Step II: model updates minimizing defense loss $L_{ m def}=L_{ m def}^{ m target}+oldsymbol{ ext{loss}}_{ m contra}$ plus a utility loss $L_{ m utility}$ on benign data. The framework uses LoRA-based cross-modal adapters with a fixed vision encoder and updates the LLM decoder, achieving robust defense across six MLLMs and six jailbreak methods while preserving normal multimodal performance. Experimental results show SafeMLLM outperforms VLGuard and other baselines in ASR reductions across both image- and text-based attacks and maintains qualitative utility. This approach advances practical safety for MLLMs in varied modalities and threat models, offering a general defense paradigm against sophisticated jailbreak strategies. All mathematical notation is expressed with $...$ to maintain clarity and reproducibility.
Abstract
While multimodal large language models (MLLMs) have achieved remarkable success in recent advancements, their susceptibility to jailbreak attacks has come to light. In such attacks, adversaries exploit carefully crafted prompts to coerce models into generating harmful or undesirable content. Existing defense mechanisms often rely on external inference steps or safety alignment training, both of which are less effective and impractical when facing sophisticated adversarial perturbations in white-box scenarios. To address these challenges and bolster MLLM robustness, we introduce SafeMLLM by adopting an adversarial training framework that alternates between an attack step for generating adversarial noise and a model updating step. At the attack step, SafeMLLM generates adversarial perturbations through a newly proposed contrastive embedding attack (CoE-Attack), which optimizes token embeddings under a contrastive objective. SafeMLLM then updates model parameters to neutralize the perturbation effects while preserving model utility on benign inputs. We evaluate SafeMLLM across six MLLMs and six jailbreak methods spanning multiple modalities. Experimental results show that SafeMLLM effectively defends against diverse attacks, maintaining robust performance and utilities.
