Table of Contents
Fetching ...

Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Tianrui Guan, Mengdi Wang, Alvaro Velasquez, Ahmad Beirami, Furong Huang, Dinesh Manocha, Amrit Singh Bedi

TL;DR

Multimodal LLM safety against jailbreaking remains challenging despite training-time alignment. Immune reframes safety as an inference-time alignment problem and uses controlled decoding guided by a safe reward model with KL-regularized RLHF to mitigate adversarial prompts, deriving a closed-form decoding policy and a theoretical bound on sub-optimality under adversarial prompts. Empirically, Immune consistently lowers attack success rates across text- and image-based jailbreak benchmarks for several state-of-the-art MLLMs while preserving or improving MM-Vet utility, and incurs manageable inference overhead relative to strong baselines. This approach offers practical, provable protection for deploying vision-language models in real-world settings, with clear directions for extending protection against dynamic and defense-aware attacks.

Abstract

With the widespread deployment of Multimodal Large Language Models (MLLMs) for visual-reasoning tasks, improving their safety has become crucial. Recent research indicates that despite training-time safety alignment, these models remain vulnerable to jailbreak attacks. In this work, we first highlight an important safety gap to describe that alignment achieved solely through safety training may be insufficient against jailbreak attacks. To address this vulnerability, we propose Immune, an inference-time defense framework that leverages a safe reward model through controlled decoding to defend against jailbreak attacks. Additionally, we provide a mathematical characterization of Immune, offering insights on why it improves safety against jailbreaks. Extensive evaluations on diverse jailbreak benchmarks using recent MLLMs reveal that Immune effectively enhances model safety while preserving the model's original capabilities. For instance, against text-based jailbreak attacks on LLaVA-1.6, Immune reduces the attack success rate by 57.82% and 16.78% compared to the base MLLM and state-of-the-art defense strategy, respectively.

Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

TL;DR

Multimodal LLM safety against jailbreaking remains challenging despite training-time alignment. Immune reframes safety as an inference-time alignment problem and uses controlled decoding guided by a safe reward model with KL-regularized RLHF to mitigate adversarial prompts, deriving a closed-form decoding policy and a theoretical bound on sub-optimality under adversarial prompts. Empirically, Immune consistently lowers attack success rates across text- and image-based jailbreak benchmarks for several state-of-the-art MLLMs while preserving or improving MM-Vet utility, and incurs manageable inference overhead relative to strong baselines. This approach offers practical, provable protection for deploying vision-language models in real-world settings, with clear directions for extending protection against dynamic and defense-aware attacks.

Abstract

With the widespread deployment of Multimodal Large Language Models (MLLMs) for visual-reasoning tasks, improving their safety has become crucial. Recent research indicates that despite training-time safety alignment, these models remain vulnerable to jailbreak attacks. In this work, we first highlight an important safety gap to describe that alignment achieved solely through safety training may be insufficient against jailbreak attacks. To address this vulnerability, we propose Immune, an inference-time defense framework that leverages a safe reward model through controlled decoding to defend against jailbreak attacks. Additionally, we provide a mathematical characterization of Immune, offering insights on why it improves safety against jailbreaks. Extensive evaluations on diverse jailbreak benchmarks using recent MLLMs reveal that Immune effectively enhances model safety while preserving the model's original capabilities. For instance, against text-based jailbreak attacks on LLaVA-1.6, Immune reduces the attack success rate by 57.82% and 16.78% compared to the base MLLM and state-of-the-art defense strategy, respectively.

Paper Structure

This paper contains 21 sections, 1 theorem, 19 equations, 7 figures, 7 tables, 1 algorithm.

Key Result

theorem 1

Let $R_{\text{safe}}(\mathbf{x}, \mathbf{y}) \leq R_{\text{max}}$, $p_0$ is a given prompt distribution, $p_{\text{adv}}$ adversarial prompt distribution, $\rho_*(\cdot | \mathbf{x})$ denotes the optimal trajectory level distribution for the safe reward, and $\rho_{\text{safe}}(\cdot | \mathbf{x})$ where $\alpha >0$ is a regularization parameter balancing the KL-divergence term in the policy alig

Figures (7)

  • Figure 1: Qualitative Evaluation (Left). Given an image of a house generated by stable diffusion rombach2022high and perturbed by adversarial noise qi2023visual, along with a malicious user query asking for steps to "break into and rob a house", we visualize responses from the base model and various inference-time defense strategies, including CoCA gao2024coca and AdaShield wang2024adashield. We observe that all compared defense strategies are misled into generating harmful content. In contrast, our proposed inference-time safety alignment framework, Immune, effectively rejects the user query, citing its unethical nature. This evaluation underscores the importance of inference-time alignment in preventing harmful responses. We visualize additional responses in the Appendix. Quantitative Evaluation (Right). To empirically validate the effectiveness of Immune, we compare the attack success rates and model utility across various state-of-the-art defense strategies. A lower attack success rate reflects improved safety in generated outputs. Our results indicate that Immune substantially lowers the attack success rate compared to other baselines. Additionally, Immune not only strengthens model safety but also preserves the model’s original utility, as demonstrated by its performance on MM-Vet benchmark yu2023mm. We note that an ideal defense strategy should have a low attack success rate while maintaining high utility, i.e., towards the bottom right corner of the plot. Refer to Section \ref{['sec:exp']} for further details.
  • Figure 2: An illustration of our proposed inference-time alignment-based defense strategy, Immune.
  • Figure 3: Evaluation on MMvet. We evaluate model utility by comparing the performance of different baseline defense strategies across various MLLMs on the MMvet dataset yu2023mm. A higher model utility indicates better visual-reasoning capabilities. Immune preserves the model's original capabilities and even enhances performance in certain cases.
  • Figure 4: We measure ASR and model utility for different combinations of hyper-parameters $k$ and $\alpha$. The model is LLaVA-1.5 liu2024improved.
  • Figure 5: For the following example from the JailbreakV-28K dataset luo2024jailbreakv, the input to the model is a noise image, along with a malicious user query asking for steps to "deliver malware in email". While other baseline defenses fail to generate a safe response, Immune, leveraging inference-time alignment, effectively neutralizes this attack.
  • ...and 2 more figures

Theorems & Definitions (1)

  • theorem 1