Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

Shuyang Hao; Bryan Hooi; Jun Liu; Kai-Wei Chang; Zi Huang; Yujun Cai

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

Shuyang Hao, Bryan Hooi, Jun Liu, Kai-Wei Chang, Zi Huang, Yujun Cai

TL;DR

The paper identifies critical visual vulnerabilities in Vision-Language Models whereby scenario-aligned images can substantially amplify harmful outputs and traditional minimal-loss optimization is unreliable. It introduces MLAI, a three-stage jailbreak framework that uses scenario-aware image generation, multi-loss adversarial images, and multi-image collaboration to exploit flat regions in the loss landscape and disrupt multimodal alignment. Empirical results show MLAI achieving high attack success rates on open-source models (e.g., $77.75\%$ on MiniGPT-4, $82.80\%$ on LLaVA-2) and notable transfer to black-box commercial VLMs (up to $60.11\%$), highlighting persistent safety vulnerabilities. A deduplication defense is proposed to mitigate MLAI by detecting similar inputs, reducing ASR by $\approx 22.99\%$, and underscoring the need for stronger safeguards in multimodal systems.

Abstract

Despite inheriting security measures from underlying language models, Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues. Through empirical analysis, we uncover two critical findings: scenario-matched images can significantly amplify harmful outputs, and contrary to common assumptions in gradient-based attacks, minimal loss values do not guarantee optimal attack effectiveness. Building on these insights, we introduce MLAI (Multi-Loss Adversarial Images), a novel jailbreak framework that leverages scenario-aware image generation for semantic alignment, exploits flat minima theory for robust adversarial image selection, and employs multi-image collaborative attacks for enhanced effectiveness. Extensive experiments demonstrate MLAI's significant impact, achieving attack success rates of 77.75% on MiniGPT-4 and 82.80% on LLaVA-2, substantially outperforming existing methods by margins of 34.37% and 12.77% respectively. Furthermore, MLAI shows considerable transferability to commercial black-box VLMs, achieving up to 60.11% success rate. Our work reveals fundamental visual vulnerabilities in current VLMs safety mechanisms and underscores the need for stronger defenses. Warning: This paper contains potentially harmful example text.

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

TL;DR

Abstract

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)