Table of Contents
Fetching ...

The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense

Yangyang Guo, Fangkai Jiao, Liqiang Nie, Mohan Kankanhalli

TL;DR

The paper investigates why Vision-Language Models remain vulnerable to jailbreaks while defenses appear overly effective on benchmarks. It argues that vision inputs undermine safety alignment, and that defenses suffer from over-prudence and evaluation inconsistencies, then proposes a simple detector-before-response LLM-Pipeline to balance safety and usefulness. Key contributions include identifying vision-induced jailbreak susceptibility, diagnosing over-prudence, and demonstrating a vision-free detector approach that complements VLLMs. The work emphasizes rethinking benchmarks, defense strategies, and evaluation protocols to advance trustworthy VLLMs in real-world settings.

Abstract

The vulnerability of Vision Large Language Models (VLLMs) to jailbreak attacks appears as no surprise. However, recent defense mechanisms against these attacks have reached near-saturation performance on benchmark evaluations, often with minimal effort. This \emph{dual high performance} in both attack and defense raises a fundamental and perplexing paradox. To gain a deep understanding of this issue and thus further help strengthen the trustworthiness of VLLMs, this paper makes three key contributions: i) One tentative explanation for VLLMs being prone to jailbreak attacks--\textbf{inclusion of vision inputs}, as well as its in-depth analysis. ii) The recognition of a largely ignored problem in existing defense mechanisms--\textbf{over-prudence}. The problem causes these defense methods to exhibit unintended abstention, even in the presence of benign inputs, thereby undermining their reliability in faithfully defending against attacks. iii) A simple safety-aware method--\textbf{LLM-Pipeline}. Our method repurposes the more advanced guardrails of LLMs on the shelf, serving as an effective alternative detector prior to VLLM response. Last but not least, we find that the two representative evaluation methods for jailbreak often exhibit chance agreement. This limitation makes it potentially misleading when evaluating attack strategies or defense mechanisms. We believe the findings from this paper offer useful insights to rethink the foundational development of VLLM safety with respect to benchmark datasets, defense strategies, and evaluation methods.

The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense

TL;DR

The paper investigates why Vision-Language Models remain vulnerable to jailbreaks while defenses appear overly effective on benchmarks. It argues that vision inputs undermine safety alignment, and that defenses suffer from over-prudence and evaluation inconsistencies, then proposes a simple detector-before-response LLM-Pipeline to balance safety and usefulness. Key contributions include identifying vision-induced jailbreak susceptibility, diagnosing over-prudence, and demonstrating a vision-free detector approach that complements VLLMs. The work emphasizes rethinking benchmarks, defense strategies, and evaluation protocols to advance trustworthy VLLMs in real-world settings.

Abstract

The vulnerability of Vision Large Language Models (VLLMs) to jailbreak attacks appears as no surprise. However, recent defense mechanisms against these attacks have reached near-saturation performance on benchmark evaluations, often with minimal effort. This \emph{dual high performance} in both attack and defense raises a fundamental and perplexing paradox. To gain a deep understanding of this issue and thus further help strengthen the trustworthiness of VLLMs, this paper makes three key contributions: i) One tentative explanation for VLLMs being prone to jailbreak attacks--\textbf{inclusion of vision inputs}, as well as its in-depth analysis. ii) The recognition of a largely ignored problem in existing defense mechanisms--\textbf{over-prudence}. The problem causes these defense methods to exhibit unintended abstention, even in the presence of benign inputs, thereby undermining their reliability in faithfully defending against attacks. iii) A simple safety-aware method--\textbf{LLM-Pipeline}. Our method repurposes the more advanced guardrails of LLMs on the shelf, serving as an effective alternative detector prior to VLLM response. Last but not least, we find that the two representative evaluation methods for jailbreak often exhibit chance agreement. This limitation makes it potentially misleading when evaluating attack strategies or defense mechanisms. We believe the findings from this paper offer useful insights to rethink the foundational development of VLLM safety with respect to benchmark datasets, defense strategies, and evaluation methods.

Paper Structure

This paper contains 17 sections, 1 equation, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Safety attributes of textual Instruction and visual Image compositions in VLLM inputs. Level of harmfulness ranked across three quadrants: II$<$IV$<$III.
  • Figure 2: Examples of harmful captions generated by the QWen2-VL model qwen2-vl in response to benign caption prompts. Top: Hateful speech against specific religions; Bottom: Harmful racially biased history. More contentious cases, such as those involving sensitive political issues, are shown in the supplementary material.
  • Figure 3: T-SNE visualization of features from unsafe(U) and safe(S) instructions (the safe points are overlaid by unsafe ones for figures 3, 6, and 9). Unlike the other two text-only models, VLLM-MM processes both textual instructions and images. The safety alignment inherent in the original LLM-Base is maintained in VLLM-Text, but is significantly compromised in VLLM-MM.
  • Figure 4: Image attention statistics from the [CLS] token of LLaVA. (a) For benign instructions, VLLMs pay more attention to unsafe images compared to safe images. (b) For the same images, the distribution of attention weights remains almost the same across instructions with distinct safety attributes.
  • Figure 5: Model abstention ratio for safe image+caption instruction (top) and safe instruction only (bottom) of VLGuard methods vl-guard.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Remark 1
  • Remark 2
  • Remark 3