Table of Contents
Fetching ...

Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models

Chenhang Cui, Gelei Deng, An Zhang, Jingnan Zheng, Yicong Li, Lianli Gao, Tianwei Zhang, Tat-Seng Chua

TL;DR

The paper demonstrates a novel vulnerability in Large Vision-Language Models where safe-looking images, when combined with safe prompts and iterative reasoning, can jailbreak models to produce unsafe outputs. It introduces Safety Snowball Agent (SSA), a two-stage, agent-based framework that leverages LVLMs' universal reasoning and a safety snowball effect to progressively escalate harm from seemingly benign inputs. Empirical results show SSA achieving high jailbreak success across GPT-4o and open LVLMs on MM-SafetyBench (approximately 88%), while also bypassing common content moderation systems, highlighting a critical safety challenge for multimodal AI systems. The work further analyzes harmfulness levels and neural activation patterns to shed light on why safe inputs can trigger unsafe behavior, informing future defenses and policy developments.

Abstract

Recent advances in Large Vision-Language Models (LVLMs) have showcased strong reasoning abilities across multiple modalities, achieving significant breakthroughs in various real-world applications. Despite this great success, the safety guardrail of LVLMs may not cover the unforeseen domains introduced by the visual modality. Existing studies primarily focus on eliciting LVLMs to generate harmful responses via carefully crafted image-based jailbreaks designed to bypass alignment defenses. In this study, we reveal that a safe image can be exploited to achieve the same jailbreak consequence when combined with additional safe images and prompts. This stems from two fundamental properties of LVLMs: universal reasoning capabilities and safety snowball effect. Building on these insights, we propose Safety Snowball Agent (SSA), a novel agent-based framework leveraging agents' autonomous and tool-using abilities to jailbreak LVLMs. SSA operates through two principal stages: (1) initial response generation, where tools generate or retrieve jailbreak images based on potential harmful intents, and (2) harmful snowballing, where refined subsequent prompts induce progressively harmful outputs. Our experiments demonstrate that \ours can use nearly any image to induce LVLMs to produce unsafe content, achieving high success jailbreaking rates against the latest LVLMs. Unlike prior works that exploit alignment flaws, \ours leverages the inherent properties of LVLMs, presenting a profound challenge for enforcing safety in generative multimodal systems. Our code is avaliable at \url{https://github.com/gzcch/Safety_Snowball_Agent}.

Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models

TL;DR

The paper demonstrates a novel vulnerability in Large Vision-Language Models where safe-looking images, when combined with safe prompts and iterative reasoning, can jailbreak models to produce unsafe outputs. It introduces Safety Snowball Agent (SSA), a two-stage, agent-based framework that leverages LVLMs' universal reasoning and a safety snowball effect to progressively escalate harm from seemingly benign inputs. Empirical results show SSA achieving high jailbreak success across GPT-4o and open LVLMs on MM-SafetyBench (approximately 88%), while also bypassing common content moderation systems, highlighting a critical safety challenge for multimodal AI systems. The work further analyzes harmfulness levels and neural activation patterns to shed light on why safe inputs can trigger unsafe behavior, informing future defenses and policy developments.

Abstract

Recent advances in Large Vision-Language Models (LVLMs) have showcased strong reasoning abilities across multiple modalities, achieving significant breakthroughs in various real-world applications. Despite this great success, the safety guardrail of LVLMs may not cover the unforeseen domains introduced by the visual modality. Existing studies primarily focus on eliciting LVLMs to generate harmful responses via carefully crafted image-based jailbreaks designed to bypass alignment defenses. In this study, we reveal that a safe image can be exploited to achieve the same jailbreak consequence when combined with additional safe images and prompts. This stems from two fundamental properties of LVLMs: universal reasoning capabilities and safety snowball effect. Building on these insights, we propose Safety Snowball Agent (SSA), a novel agent-based framework leveraging agents' autonomous and tool-using abilities to jailbreak LVLMs. SSA operates through two principal stages: (1) initial response generation, where tools generate or retrieve jailbreak images based on potential harmful intents, and (2) harmful snowballing, where refined subsequent prompts induce progressively harmful outputs. Our experiments demonstrate that \ours can use nearly any image to induce LVLMs to produce unsafe content, achieving high success jailbreaking rates against the latest LVLMs. Unlike prior works that exploit alignment flaws, \ours leverages the inherent properties of LVLMs, presenting a profound challenge for enforcing safety in generative multimodal systems. Our code is avaliable at \url{https://github.com/gzcch/Safety_Snowball_Agent}.

Paper Structure

This paper contains 27 sections, 6 equations, 14 figures, 24 tables, 1 algorithm.

Figures (14)

  • Figure 1: An example of harmful content generation on GPT-4o using a seemingly safe image. The orange-background texts and image are generated by SSA. The detailed method of generating such harmful content is in Section \ref{['sec:ours']}. More cases can be found in Appendix \ref{['app:more_cases']}.
  • Figure 2: Jailbreak success rate comparison of SSA with baseline methods across different LVLMs. SSA demonstrates the highest capability to exploit diverse images for generating harmful content.
  • Figure 3: An example of the safety snowball effect in GPT-4o.
  • Figure 4: Jailbreak success rate of harmful snowballing and direct answer across different LVLMs.
  • Figure 5: Self-evaluation results for harmful snowballing response across different LVLMs.
  • ...and 9 more figures