Table of Contents
Fetching ...

Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts

Hee-Seon Kim, Minbeom Kim, Wonjun Lee, Kihyun Kim, Changick Kim

TL;DR

This work exposes a universal vulnerability in multimodal safety alignment by showing that Toxic-Continuation alone cannot reliably induce harm from benign prompts. It introduces Benign-to-Toxic (B2T) jailbreak, which optimizes adversarial images to elicit toxic outputs even when conditioning is harmless, and demonstrates that this approach transfers across models and complements text-based attacks. The authors formulate explicit loss functions for Toxic-Continuation and Benign-to-Toxic objectives, and propose a mixed objective that first breaks safety with B2T and then continues with toxicity via continuation, achieving strong universal jailbreak performance. Comprehensive experiments across five benchmarks, four LVLMs, and multiple safety evaluators reveal that B2T consistently outperforms prior methods, remains robust under JPEG defenses, and shows strong black-box transferability and synergy with GCG-based text prompts. These findings reveal a fundamental weakness in multimodal alignment and have significant implications for designing robust defenses against universal visual jailbreaks.

Abstract

Optimization-based jailbreaks typically adopt the Toxic-Continuation setting in large vision-language models (LVLMs), following the standard next-token prediction objective. In this setting, an adversarial image is optimized to make the model predict the next token of a toxic prompt. However, we find that the Toxic-Continuation paradigm is effective at continuing already-toxic inputs, but struggles to induce safety misalignment when explicit toxic signals are absent. We propose a new paradigm: Benign-to-Toxic (B2T) jailbreak. Unlike prior work, we optimize adversarial images to induce toxic outputs from benign conditioning. Since benign conditioning contains no safety violations, the image alone must break the model's safety mechanisms. Our method outperforms prior approaches, transfers in black-box settings, and complements text-based jailbreaks. These results reveal an underexplored vulnerability in multimodal alignment and introduce a fundamentally new direction for jailbreak approaches.

Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts

TL;DR

This work exposes a universal vulnerability in multimodal safety alignment by showing that Toxic-Continuation alone cannot reliably induce harm from benign prompts. It introduces Benign-to-Toxic (B2T) jailbreak, which optimizes adversarial images to elicit toxic outputs even when conditioning is harmless, and demonstrates that this approach transfers across models and complements text-based attacks. The authors formulate explicit loss functions for Toxic-Continuation and Benign-to-Toxic objectives, and propose a mixed objective that first breaks safety with B2T and then continues with toxicity via continuation, achieving strong universal jailbreak performance. Comprehensive experiments across five benchmarks, four LVLMs, and multiple safety evaluators reveal that B2T consistently outperforms prior methods, remains robust under JPEG defenses, and shows strong black-box transferability and synergy with GCG-based text prompts. These findings reveal a fundamental weakness in multimodal alignment and have significant implications for designing robust defenses against universal visual jailbreaks.

Abstract

Optimization-based jailbreaks typically adopt the Toxic-Continuation setting in large vision-language models (LVLMs), following the standard next-token prediction objective. In this setting, an adversarial image is optimized to make the model predict the next token of a toxic prompt. However, we find that the Toxic-Continuation paradigm is effective at continuing already-toxic inputs, but struggles to induce safety misalignment when explicit toxic signals are absent. We propose a new paradigm: Benign-to-Toxic (B2T) jailbreak. Unlike prior work, we optimize adversarial images to induce toxic outputs from benign conditioning. Since benign conditioning contains no safety violations, the image alone must break the model's safety mechanisms. Our method outperforms prior approaches, transfers in black-box settings, and complements text-based jailbreaks. These results reveal an underexplored vulnerability in multimodal alignment and introduce a fundamentally new direction for jailbreak approaches.

Paper Structure

This paper contains 47 sections, 4 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: Toxic–Continuation vs. Benign–to–Toxic. (a) Clean images alone do not break safety alignment. (b) Prior methods succeed when the input prompt is explicitly toxic (e.g., murder my spouse), but they often fail in the absence of explicit toxicity. (c) Our Benign-to-Toxic (B2T) approach overcomes this by optimizing images to induce toxic responses even from benign input.
  • Figure 2: Toxic–Continuation vs. Benign–to–Toxic Adversarial Image Optimization. (a) Prior methods optimize an adversarial image so that the LVLM continues a toxic conditioning. (b) Our Benign-to-Toxic setup decouples conditioning and target: the LVLM is given a benign conditioning (e.g., '<bos> Humans need'), and the image is optimized to force the generation of harmful tokens (e.g., 'stupid') as target. This enables stronger misalignment capabilities and better reflects subtle real-world jailbreak threats. (For clarity, the figure highlights only one output step per method, though optimization proceeds in parallel across outputs.)
  • Figure 3: Relationship between prompt toxicity scores and the frequency of harmful outputs generated by LVLMs (LLaVA-1.5 and InstructBLIP). The data indicates that higher toxicity scores in prompts correlate with an increased likelihood of generating harmful content, even without adversarial prompts.
  • Figure 4: Category-wise toxicity scores across benchmarks for different jailbreak strategies. Compared to Clean and Toxic-Continuation-based adversarial images, our Benign-to-Toxic-based adversarial image consistently triggers higher toxicity, regardless of the input prompt’s explicit toxicity level across benchmarks.
  • Figure 5: Black-Box Transferability. ASR (%) of adversarial images generated on a source LVLM and evaluated on target LVLMs in a black-box setting.
  • ...and 7 more figures