Table of Contents
Fetching ...

Retention Score: Quantifying Jailbreak Risks for Vision Language Models

Zaitang Li, Pin-Yu Chen, Tsung-Yi Ho

TL;DR

The paper tackles jailbreak risks in Vision-Language Models by introducing Retention Score, a multi-modal, margin-based robustness certificate that certifies resistance to toxicity-inducing perturbations. It defines two components, Retention-I for image perturbations and Retention-T for text perturbations, computed via diffusion-based data generation and a toxicity judgment framework, and proves their validity as robustness certificates under $\ell_2$ perturbations. Empirically, Retention Score consistently ranks VLM robustness across open-source models (e.g., MiniGPT-4, InstructBLIP, LLaVA) and black-box APIs (GPT-4V, Gemini Pro Vision), while offering substantial run-time improvements (up to $~30\times$) over traditional jailbreak-attacks. The work reveals that incorporating visual modalities can amplify jailbreak risks and provides a practical, scalable metric for safety evaluation and model-card-style reporting. Overall, Retention Score enables efficient, attack-agnostic robustness assessment of VLMs with broad applicability to both research and deployment contexts. $R_I$ and $R_T$ serve as principled bounds that quantify resilience against adversarial toxicity in multimodal models.$

Abstract

The emergence of Vision-Language Models (VLMs) is a significant advancement in integrating computer vision with Large Language Models (LLMs) to enhance multi-modal machine learning capabilities. However, this progress has also made VLMs vulnerable to sophisticated adversarial attacks, raising concerns about their reliability. The objective of this paper is to assess the resilience of VLMs against jailbreak attacks that can compromise model safety compliance and result in harmful outputs. To evaluate a VLM's ability to maintain its robustness against adversarial input perturbations, we propose a novel metric called the \textbf{Retention Score}. Retention Score is a multi-modal evaluation metric that includes Retention-I and Retention-T scores for quantifying jailbreak risks in visual and textual components of VLMs. Our process involves generating synthetic image-text pairs using a conditional diffusion model. These pairs are then predicted for toxicity score by a VLM alongside a toxicity judgment classifier. By calculating the margin in toxicity scores, we can quantify the robustness of the VLM in an attack-agnostic manner. Our work has four main contributions. First, we prove that Retention Score can serve as a certified robustness metric. Second, we demonstrate that most VLMs with visual components are less robust against jailbreak attacks than the corresponding plain VLMs. Additionally, we evaluate black-box VLM APIs and find that the security settings in Google Gemini significantly affect the score and robustness. Moreover, the robustness of GPT4V is similar to the medium settings of Gemini. Finally, our approach offers a time-efficient alternative to existing adversarial attack methods and provides consistent model robustness rankings when evaluated on VLMs including MiniGPT-4, InstructBLIP, and LLaVA.

Retention Score: Quantifying Jailbreak Risks for Vision Language Models

TL;DR

The paper tackles jailbreak risks in Vision-Language Models by introducing Retention Score, a multi-modal, margin-based robustness certificate that certifies resistance to toxicity-inducing perturbations. It defines two components, Retention-I for image perturbations and Retention-T for text perturbations, computed via diffusion-based data generation and a toxicity judgment framework, and proves their validity as robustness certificates under perturbations. Empirically, Retention Score consistently ranks VLM robustness across open-source models (e.g., MiniGPT-4, InstructBLIP, LLaVA) and black-box APIs (GPT-4V, Gemini Pro Vision), while offering substantial run-time improvements (up to ) over traditional jailbreak-attacks. The work reveals that incorporating visual modalities can amplify jailbreak risks and provides a practical, scalable metric for safety evaluation and model-card-style reporting. Overall, Retention Score enables efficient, attack-agnostic robustness assessment of VLMs with broad applicability to both research and deployment contexts. and serve as principled bounds that quantify resilience against adversarial toxicity in multimodal models.$

Abstract

The emergence of Vision-Language Models (VLMs) is a significant advancement in integrating computer vision with Large Language Models (LLMs) to enhance multi-modal machine learning capabilities. However, this progress has also made VLMs vulnerable to sophisticated adversarial attacks, raising concerns about their reliability. The objective of this paper is to assess the resilience of VLMs against jailbreak attacks that can compromise model safety compliance and result in harmful outputs. To evaluate a VLM's ability to maintain its robustness against adversarial input perturbations, we propose a novel metric called the \textbf{Retention Score}. Retention Score is a multi-modal evaluation metric that includes Retention-I and Retention-T scores for quantifying jailbreak risks in visual and textual components of VLMs. Our process involves generating synthetic image-text pairs using a conditional diffusion model. These pairs are then predicted for toxicity score by a VLM alongside a toxicity judgment classifier. By calculating the margin in toxicity scores, we can quantify the robustness of the VLM in an attack-agnostic manner. Our work has four main contributions. First, we prove that Retention Score can serve as a certified robustness metric. Second, we demonstrate that most VLMs with visual components are less robust against jailbreak attacks than the corresponding plain VLMs. Additionally, we evaluate black-box VLM APIs and find that the security settings in Google Gemini significantly affect the score and robustness. Moreover, the robustness of GPT4V is similar to the medium settings of Gemini. Finally, our approach offers a time-efficient alternative to existing adversarial attack methods and provides consistent model robustness rankings when evaluated on VLMs including MiniGPT-4, InstructBLIP, and LLaVA.

Paper Structure

This paper contains 38 sections, 7 theorems, 20 equations, 6 figures, 8 tables, 2 algorithms.

Key Result

Theorem 1

For a given image $I$ and a collection of text prompts $\mathbb{X}$, consider Retention Image Score $R_I$ defined in eq:retention_image and Retention Text Score $R_T$ defined in eq:retention_text. For each $T \in \mathbb{X}$ satisfying the condition $M_{nt}(I, T) \geq M_{t}(I, T)$, indicating a non- (II) Similarly, for perturbations within the semantic space of $T$, the worst-case non-toxic score

Figures (6)

  • Figure 1: (a) An adversarial image optimized on harmful corpus can jailbreak a VLM (Qi et al. 2023a). The model will not refuse to generate harmful responses. (b) Flow chart of calculating Retention-Image and Retention-Text scores for VLMs. Given some evaluation samples, we first use diffusion generators to create semantically similar synthetic samples. Then, we pass the generated samples into a VLM to get responses and further use a toxicity judgment model (e.g., Perspective API 1 or an LLM like Llama-70B (Touvron et al. 2023)) for toxicity level predictions. Finally, we use these statistics to compute the Retention Score as detailed in Section 3.2. (c) Consistency of Attack Success Rate (ASR) using attack in (Qi et al. 2023a) and Retention Score. A higher score means lower jailbreak risks (a lower ASR is expected).
  • Figure 2: Run-time improvement (Retention Score over Visual and Text attacks) .
  • Figure 3: Generated Images for old subgroup.
  • Figure 4: Generated Images for young subgroup.
  • Figure 5: Generated Images for female subgroup.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Theorem 1: Robustness Certification via Retention Score
  • Remark 1
  • Lemma 1: Lipschitz Continuity in Gradient Form for VLM in image aspect (paulavivcius2006analysis)
  • Lemma 2: Formal guarantee on lower bound of VLM for adversarial image attacks.
  • Lemma 3: Stein's lemma stein1981estimation
  • Lemma 4: Proof of global Lipschitz constant
  • Lemma 5: Lipschitz Continuity in Gradient Form for VLM in text aspect (paulavivcius2006analysis)
  • Lemma 6: Text-Based robustness guarantee.