Table of Contents
Fetching ...

UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models

Sejoon Oh, Yiqiao Jin, Megha Sharma, Donghyun Kim, Eric Ma, Gaurav Verma, Srijan Kumar

TL;DR

Multimodal LLMs are vulnerable to jailbreak attacks that manipulate vision and text to elicit unsafe outputs. UniGuard presents universal multimodal safety guardrails that optimize per-modality defenses—image noise guards and text suffix controls—to suppress harmful content with minimal impact on general vision-language tasks. Across multiple models, UniGuard substantially reduces attack success while preserving core capabilities, demonstrating transferability and practical utility for safer deployment of MLLMs. The work highlights a path toward robust, scalable safety mechanisms in multimodal AI systems and points to future enhancements in model-specific tailoring and modality expansion.

Abstract

Multimodal large language models (MLLMs) have revolutionized vision-language understanding but remain vulnerable to multimodal jailbreak attacks, where adversarial inputs are meticulously crafted to elicit harmful or inappropriate responses. We propose UniGuard, a novel multimodal safety guardrail that jointly considers the unimodal and cross-modal harmful signals. UniGuard trains a multimodal guardrail to minimize the likelihood of generating harmful responses in a toxic corpus. The guardrail can be seamlessly applied to any input prompt during inference with minimal computational costs. Extensive experiments demonstrate the generalizability of UniGuard across multiple modalities, attack strategies, and multiple state-of-the-art MLLMs, including LLaVA, Gemini Pro, GPT-4o, MiniGPT-4, and InstructBLIP. Notably, this robust defense mechanism maintains the models' overall vision-language understanding capabilities.

UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models

TL;DR

Multimodal LLMs are vulnerable to jailbreak attacks that manipulate vision and text to elicit unsafe outputs. UniGuard presents universal multimodal safety guardrails that optimize per-modality defenses—image noise guards and text suffix controls—to suppress harmful content with minimal impact on general vision-language tasks. Across multiple models, UniGuard substantially reduces attack success while preserving core capabilities, demonstrating transferability and practical utility for safer deployment of MLLMs. The work highlights a path toward robust, scalable safety mechanisms in multimodal AI systems and points to future enhancements in model-specific tailoring and modality expansion.

Abstract

Multimodal large language models (MLLMs) have revolutionized vision-language understanding but remain vulnerable to multimodal jailbreak attacks, where adversarial inputs are meticulously crafted to elicit harmful or inappropriate responses. We propose UniGuard, a novel multimodal safety guardrail that jointly considers the unimodal and cross-modal harmful signals. UniGuard trains a multimodal guardrail to minimize the likelihood of generating harmful responses in a toxic corpus. The guardrail can be seamlessly applied to any input prompt during inference with minimal computational costs. Extensive experiments demonstrate the generalizability of UniGuard across multiple modalities, attack strategies, and multiple state-of-the-art MLLMs, including LLaVA, Gemini Pro, GPT-4o, MiniGPT-4, and InstructBLIP. Notably, this robust defense mechanism maintains the models' overall vision-language understanding capabilities.

Paper Structure

This paper contains 19 sections, 2 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: UniGuard robustifies multimodal large language models (MLLMs) against multimodal jailbreak attacks by using safety guardrails to purify malicious input prompt, ensuring safe responses.
  • Figure 2: Overview of UniGuard. Multimodal safety guardrails (right) are optimized to minimize the likelihood of generating harmful content sampled from a corpus $\mathcal{C}$ (left-top) on the open-source MLLM model: LLaVA 1.5 (left-bottom). We use projected gradient descent for optimization (middle). We apply the guardrails to any input prompt of MLLMs.
  • Figure 3: Transferability of UniGuard on MiniGPT-4, InstructBLIP, GPT-4o, Gemini Pro against unconstrained adversarial visual attacks qi2023visual with the RTP gehman2020realtoxicityprompts text prompt dataset. A lower success ratio ($\downarrow$) is better. We test three groups of methods: 1) the original model under unconstrained attack (Attack); 2) five baseline methods, including BlurKernel (3x3) (Blur), Comp-Decomp with quality=10 (Comp), DiffPurenie2022diffusion (DP), SmoothLLMrobey2023smoothllm (SLLM), and VLGuard zongsafety; 3) our proposed UniGuard with image & optimized text guardrails (Ours+O) and pre-defined text guardrails (Ours+P).
  • Figure 4: Performance of various defense strategies on MM-Vet yu2023mm. The impact on accuracy is minimal when the noise level is controlled at $\epsilon=16/255$ or $32/255$.
  • Figure 5: Attack success ratio of UniGuard and baseline defense methods against constrained adversarial visual attacks qi2023visual on MiniGPT-4 (Left), and InstructBLIP (Right). A lower success ratio ($\downarrow$) is better. We show the attack success ratios among three groups of methods: 1) the original model under unconstrained attack (Attack); 2) the six baseline methods, including random perturbation (random) BlurKernel (3x3) (Blur), Comp-Decomp with quality=10 (Comp), DiffPurenie2022diffusion (DP), SmoothLLMrobey2023smoothllm (SLLM), and VLGuard zongsafety; 3) our proposed UniGuard, including UniGuard with image & optimized text guardrails (Ours+O) and pre-defined text guardrails (Ours+P).
  • ...and 2 more figures