Table of Contents
Fetching ...

RedDiffuser: Red Teaming Vision-Language Models for Toxic Continuation via Reinforced Stable Diffusion

Ruofan Wang, Xiang Zheng, Xiaosen Wang, Cong Wang, Xingjun Ma, Yu-Gang Jiang

TL;DR

This paper identifies a cross-modal vulnerability in vision-language models where carefully crafted, natural-looking images can induce toxic continuations when paired with malicious prefixes. It introduces RedDiffuser, a two-phase framework that greedily searches image prompts via a red-team LLM and then RL-tunes diffusion models with a dual reward that promotes toxicity while preserving semantic coherence. The approach demonstrates notable toxicity increases in LLaVA and transferable effects to Gemini and LLaMA-Vision, even under external guardrails, and shows that safety gaps persist in multimodal alignment. The work highlights the need for robust defenses that jointly consider image semantics and continuation safety, and releases a codebase to foster further multimodal red-teaming research.

Abstract

Vision-Language Models (VLMs) are vulnerable to jailbreak attacks, where adversaries bypass safety mechanisms to elicit harmful outputs. In this work, we examine an insidious variant of this threat: toxic continuation. Unlike standard jailbreaks that rely solely on malicious instructions, toxic continuation arises when the model is given a malicious input alongside a partial toxic output, resulting in harmful completions. This vulnerability poses a unique challenge in multimodal settings, where even subtle image variations can disproportionately affect the model's response. To this end, we propose RedDiffuser (RedDiff), the first red teaming framework that uses reinforcement learning to fine-tune diffusion models into generating natural-looking adversarial images that induce toxic continuations. RedDiffuser integrates a greedy search procedure for selecting candidate image prompts with reinforcement fine-tuning that jointly promotes toxic output and semantic coherence. Experiments demonstrate that RedDiffuser significantly increases the toxicity rate in LLaVA outputs by 10.69% and 8.91% on the original and hold-out sets, respectively. It also exhibits strong transferability, increasing toxicity rates on Gemini by 5.1% and on LLaMA-Vision by 26.83%. These findings uncover a cross-modal toxicity amplification vulnerability in current VLM alignment, highlighting the need for robust multimodal red teaming. We will release the RedDiffuser codebase to support future research.

RedDiffuser: Red Teaming Vision-Language Models for Toxic Continuation via Reinforced Stable Diffusion

TL;DR

This paper identifies a cross-modal vulnerability in vision-language models where carefully crafted, natural-looking images can induce toxic continuations when paired with malicious prefixes. It introduces RedDiffuser, a two-phase framework that greedily searches image prompts via a red-team LLM and then RL-tunes diffusion models with a dual reward that promotes toxicity while preserving semantic coherence. The approach demonstrates notable toxicity increases in LLaVA and transferable effects to Gemini and LLaMA-Vision, even under external guardrails, and shows that safety gaps persist in multimodal alignment. The work highlights the need for robust defenses that jointly consider image semantics and continuation safety, and releases a codebase to foster further multimodal red-teaming research.

Abstract

Vision-Language Models (VLMs) are vulnerable to jailbreak attacks, where adversaries bypass safety mechanisms to elicit harmful outputs. In this work, we examine an insidious variant of this threat: toxic continuation. Unlike standard jailbreaks that rely solely on malicious instructions, toxic continuation arises when the model is given a malicious input alongside a partial toxic output, resulting in harmful completions. This vulnerability poses a unique challenge in multimodal settings, where even subtle image variations can disproportionately affect the model's response. To this end, we propose RedDiffuser (RedDiff), the first red teaming framework that uses reinforcement learning to fine-tune diffusion models into generating natural-looking adversarial images that induce toxic continuations. RedDiffuser integrates a greedy search procedure for selecting candidate image prompts with reinforcement fine-tuning that jointly promotes toxic output and semantic coherence. Experiments demonstrate that RedDiffuser significantly increases the toxicity rate in LLaVA outputs by 10.69% and 8.91% on the original and hold-out sets, respectively. It also exhibits strong transferability, increasing toxicity rates on Gemini by 5.1% and on LLaMA-Vision by 26.83%. These findings uncover a cross-modal toxicity amplification vulnerability in current VLM alignment, highlighting the need for robust multimodal red teaming. We will release the RedDiffuser codebase to support future research.

Paper Structure

This paper contains 19 sections, 9 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Example of toxic continuation on LLaVA. RedDiffuser generates images capable of inducing harmful continuations, whereas images from a standard diffusion model lead to benign VLM output.
  • Figure 2: RedDiffuser overview. Given an incomplete toxic sentence, Gemini selects an image prompt via greedy search. A diffusion model generates an image, which is passed to a VLM (LLaVA) to produce a continuation. Toxicity and alignment scores from Detoxify and BERTScore are used as rewards to fine-tune the diffusion model.
  • Figure 3: Comparison of images generated by the general-purpose Stable Diffusion (left) and the RedDiffuser (right). Given the same image prompts, RedDiffuser produces images that subtly elicit discomfort or unease, making them more likely to trigger harmful continuations in VLMs.
  • Figure 4: Visual comparison across diffusion variants. Stable Diffusion outputs relatively neutral imagery; RedDiff (Base) produces stronger semantic cues that influence continuations. RedDiff (Guard) adjusts visual subtlety to consistently pass external safety checkers while retaining adversarial intent.