UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models
Sejoon Oh, Yiqiao Jin, Megha Sharma, Donghyun Kim, Eric Ma, Gaurav Verma, Srijan Kumar
TL;DR
Multimodal LLMs are vulnerable to jailbreak attacks that manipulate vision and text to elicit unsafe outputs. UniGuard presents universal multimodal safety guardrails that optimize per-modality defenses—image noise guards and text suffix controls—to suppress harmful content with minimal impact on general vision-language tasks. Across multiple models, UniGuard substantially reduces attack success while preserving core capabilities, demonstrating transferability and practical utility for safer deployment of MLLMs. The work highlights a path toward robust, scalable safety mechanisms in multimodal AI systems and points to future enhancements in model-specific tailoring and modality expansion.
Abstract
Multimodal large language models (MLLMs) have revolutionized vision-language understanding but remain vulnerable to multimodal jailbreak attacks, where adversarial inputs are meticulously crafted to elicit harmful or inappropriate responses. We propose UniGuard, a novel multimodal safety guardrail that jointly considers the unimodal and cross-modal harmful signals. UniGuard trains a multimodal guardrail to minimize the likelihood of generating harmful responses in a toxic corpus. The guardrail can be seamlessly applied to any input prompt during inference with minimal computational costs. Extensive experiments demonstrate the generalizability of UniGuard across multiple modalities, attack strategies, and multiple state-of-the-art MLLMs, including LLaVA, Gemini Pro, GPT-4o, MiniGPT-4, and InstructBLIP. Notably, this robust defense mechanism maintains the models' overall vision-language understanding capabilities.
