Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts
Hee-Seon Kim, Minbeom Kim, Wonjun Lee, Kihyun Kim, Changick Kim
TL;DR
This work exposes a universal vulnerability in multimodal safety alignment by showing that Toxic-Continuation alone cannot reliably induce harm from benign prompts. It introduces Benign-to-Toxic (B2T) jailbreak, which optimizes adversarial images to elicit toxic outputs even when conditioning is harmless, and demonstrates that this approach transfers across models and complements text-based attacks. The authors formulate explicit loss functions for Toxic-Continuation and Benign-to-Toxic objectives, and propose a mixed objective that first breaks safety with B2T and then continues with toxicity via continuation, achieving strong universal jailbreak performance. Comprehensive experiments across five benchmarks, four LVLMs, and multiple safety evaluators reveal that B2T consistently outperforms prior methods, remains robust under JPEG defenses, and shows strong black-box transferability and synergy with GCG-based text prompts. These findings reveal a fundamental weakness in multimodal alignment and have significant implications for designing robust defenses against universal visual jailbreaks.
Abstract
Optimization-based jailbreaks typically adopt the Toxic-Continuation setting in large vision-language models (LVLMs), following the standard next-token prediction objective. In this setting, an adversarial image is optimized to make the model predict the next token of a toxic prompt. However, we find that the Toxic-Continuation paradigm is effective at continuing already-toxic inputs, but struggles to induce safety misalignment when explicit toxic signals are absent. We propose a new paradigm: Benign-to-Toxic (B2T) jailbreak. Unlike prior work, we optimize adversarial images to induce toxic outputs from benign conditioning. Since benign conditioning contains no safety violations, the image alone must break the model's safety mechanisms. Our method outperforms prior approaches, transfers in black-box settings, and complements text-based jailbreaks. These results reveal an underexplored vulnerability in multimodal alignment and introduce a fundamentally new direction for jailbreak approaches.
