Table of Contents
Fetching ...

BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks

Yunhan Zhao, Xiang Zheng, Lin Luo, Yige Li, Xingjun Ma, Yu-Gang Jiang

TL;DR

This work addresses the vulnerability of vision-language models to multimodal jailbreak attacks under black-box conditions. It introduces BlueSuffix, a blue-team defense that combines a diffusion-based image purifier, an LLM-based text purifier, and a reinforcement-tuned blue-team suffix generator to enhance cross-modal robustness without harming benign performance. The method demonstrates substantial reductions in attack success rates across open-source and commercial VLMs, including resilience against adaptive attacks and transferability to unseen datasets, highlighting the practicality and effectiveness of blue-teaming in securing VLMs. The results suggest that leveraging cross-modal optimization and modular purifiers can significantly strengthen VLM safety in real-world deployments while maintaining user experience on benign prompts.

Abstract

In this paper, we focus on black-box defense for VLMs against jailbreak attacks. Existing black-box defense methods are either unimodal or bimodal. Unimodal methods enhance either the vision or language module of the VLM, while bimodal methods robustify the model through text-image representation realignment. However, these methods suffer from two limitations: 1) they fail to fully exploit the cross-modal information, or 2) they degrade the model performance on benign inputs. To address these limitations, we propose a novel blue-team method BlueSuffix that defends target VLMs against jailbreak attacks without compromising its performance under black-box setting. BlueSuffix includes three key components: 1) a visual purifier against jailbreak images, 2) a textual purifier against jailbreak texts, and 3) a blue-team suffix generator using reinforcement fine-tuning for enhancing cross-modal robustness. We empirically show on four VLMs (LLaVA, MiniGPT-4, InstructionBLIP, and Gemini) and four safety benchmarks (Harmful Instruction, AdvBench, MM-SafetyBench, and RedTeam-2K) that BlueSuffix outperforms the baseline defenses by a significant margin. Our BlueSuffix opens up a promising direction for defending VLMs against jailbreak attacks. Code is available at https://github.com/Vinsonzyh/BlueSuffix.

BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks

TL;DR

This work addresses the vulnerability of vision-language models to multimodal jailbreak attacks under black-box conditions. It introduces BlueSuffix, a blue-team defense that combines a diffusion-based image purifier, an LLM-based text purifier, and a reinforcement-tuned blue-team suffix generator to enhance cross-modal robustness without harming benign performance. The method demonstrates substantial reductions in attack success rates across open-source and commercial VLMs, including resilience against adaptive attacks and transferability to unseen datasets, highlighting the practicality and effectiveness of blue-teaming in securing VLMs. The results suggest that leveraging cross-modal optimization and modular purifiers can significantly strengthen VLM safety in real-world deployments while maintaining user experience on benign prompts.

Abstract

In this paper, we focus on black-box defense for VLMs against jailbreak attacks. Existing black-box defense methods are either unimodal or bimodal. Unimodal methods enhance either the vision or language module of the VLM, while bimodal methods robustify the model through text-image representation realignment. However, these methods suffer from two limitations: 1) they fail to fully exploit the cross-modal information, or 2) they degrade the model performance on benign inputs. To address these limitations, we propose a novel blue-team method BlueSuffix that defends target VLMs against jailbreak attacks without compromising its performance under black-box setting. BlueSuffix includes three key components: 1) a visual purifier against jailbreak images, 2) a textual purifier against jailbreak texts, and 3) a blue-team suffix generator using reinforcement fine-tuning for enhancing cross-modal robustness. We empirically show on four VLMs (LLaVA, MiniGPT-4, InstructionBLIP, and Gemini) and four safety benchmarks (Harmful Instruction, AdvBench, MM-SafetyBench, and RedTeam-2K) that BlueSuffix outperforms the baseline defenses by a significant margin. Our BlueSuffix opens up a promising direction for defending VLMs against jailbreak attacks. Code is available at https://github.com/Vinsonzyh/BlueSuffix.

Paper Structure

This paper contains 39 sections, 7 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: An illustration of our BlueSuffix defense. A pair of image-text jailbreak prompts (left) can compromise the target VLM to output harmful content (top right). However, the purified and suffixed prompts by our BlueSuffix (middle) lose their adversarial property (bottom right).
  • Figure 2: An overview of BlueSuffix and its three key components: 1) an image purifier, 2) an LLM-based text purifier, and 3) a lightweight LLM-based (e.g., GPT-2) blue-team suffix generator. The suffix generator is trained to maximize the expected safety score given by an LLM-based judge.
  • Figure 3: Component ablation of BlueSuffix.
  • Figure 4: An example of jailbreaking textual prompt purified by GPT-4o and Llama-3-8B-Instruct.
  • Figure 5: The LLM-based rewrite template.
  • ...and 5 more figures