Table of Contents
Fetching ...

JaiLIP: Jailbreaking Vision-Language Models via Loss Guided Image Perturbation

Md Jueal Mia, M. Hadi Amini

TL;DR

This work addresses the vulnerability of vision-language models to image-space jailbreaks by proposing JaiLIP, a loss-guided perturbation framework that jointly minimizes perceptual distortion and maximizes a harmful-output loss using a tanh-based reparameterization to bound pixel changes. The method operates entirely in image space, does not rely on text prompts, and optimizes a total loss $L_{total} = L_{MSE} + c \cdot L_{model}$ with $L_{total}$ computed over a batch of toxic targets. Empirical evaluations on BLIP-2 and MiniGPT-4 with toxicity metrics from Perspective API and Detoxify show JaiLIP achieving higher toxicity and attack success than PGD baselines, while remaining visually similar to the original input; the transportation-domain experiment further demonstrates cross-domain applicability. The results underscore the need for stronger safety defenses in VLMs and highlight practical considerations for defending multimodal systems against image-based jailbreak attacks.

Abstract

Vision-Language Models (VLMs) have remarkable abilities in generating multimodal reasoning tasks. However, potential misuse or safety alignment concerns of VLMs have increased significantly due to different categories of attack vectors. Among various attack vectors, recent studies have demonstrated that image-based perturbations are particularly effective in generating harmful outputs. In the literature, many existing techniques have been proposed to jailbreak VLMs, leading to unstable performance and visible perturbations. In this study, we propose Jailbreaking with Loss-guided Image Perturbation (JaiLIP), a jailbreaking attack in the image space that minimizes a joint objective combining the mean squared error (MSE) loss between clean and adversarial image with the models harmful-output loss. We evaluate our proposed method on VLMs using standard toxicity metrics from Perspective API and Detoxify. Experimental results demonstrate that our method generates highly effective and imperceptible adversarial images, outperforming existing methods in producing toxicity. Moreover, we have evaluated our method in the transportation domain to demonstrate the attacks practicality beyond toxic text generation in specific domain. Our findings emphasize the practical challenges of image-based jailbreak attacks and the need for efficient defense mechanisms for VLMs.

JaiLIP: Jailbreaking Vision-Language Models via Loss Guided Image Perturbation

TL;DR

This work addresses the vulnerability of vision-language models to image-space jailbreaks by proposing JaiLIP, a loss-guided perturbation framework that jointly minimizes perceptual distortion and maximizes a harmful-output loss using a tanh-based reparameterization to bound pixel changes. The method operates entirely in image space, does not rely on text prompts, and optimizes a total loss with computed over a batch of toxic targets. Empirical evaluations on BLIP-2 and MiniGPT-4 with toxicity metrics from Perspective API and Detoxify show JaiLIP achieving higher toxicity and attack success than PGD baselines, while remaining visually similar to the original input; the transportation-domain experiment further demonstrates cross-domain applicability. The results underscore the need for stronger safety defenses in VLMs and highlight practical considerations for defending multimodal systems against image-based jailbreak attacks.

Abstract

Vision-Language Models (VLMs) have remarkable abilities in generating multimodal reasoning tasks. However, potential misuse or safety alignment concerns of VLMs have increased significantly due to different categories of attack vectors. Among various attack vectors, recent studies have demonstrated that image-based perturbations are particularly effective in generating harmful outputs. In the literature, many existing techniques have been proposed to jailbreak VLMs, leading to unstable performance and visible perturbations. In this study, we propose Jailbreaking with Loss-guided Image Perturbation (JaiLIP), a jailbreaking attack in the image space that minimizes a joint objective combining the mean squared error (MSE) loss between clean and adversarial image with the models harmful-output loss. We evaluate our proposed method on VLMs using standard toxicity metrics from Perspective API and Detoxify. Experimental results demonstrate that our method generates highly effective and imperceptible adversarial images, outperforming existing methods in producing toxicity. Moreover, we have evaluated our method in the transportation domain to demonstrate the attacks practicality beyond toxic text generation in specific domain. Our findings emphasize the practical challenges of image-based jailbreak attacks and the need for efficient defense mechanisms for VLMs.

Paper Structure

This paper contains 8 sections, 2 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of the proposed JaiLIP framework.
  • Figure 2: Visualization of clean and adversarial examples generated under different perturbation settings using BLIP-2. (a) Clean image, (b) Constrained (16/255), (c) Constrained (32/255), (d) Constrained (64/255), and (e) JaiLIP.
  • Figure 3: Visualization of clean and adversarial examples generated by the proposed JaiLIP method using two different VLMs. (a) Clean image without attack, (b) adversarial image optimized using BLIP-2, and (c) adversarial image optimized using MiniGPT-4.
  • Figure 4: Sample prompt with clean and JaiLIP outputs, Model: BLIP-2.