Table of Contents
Fetching ...

Zer0-Jack: A Memory-efficient Gradient-based Jailbreaking Method for Black-box Multi-modal Large Language Models

Tiejin Chen, Kaishen Wang, Hua Wei

TL;DR

Zer0-Jack tackles the problem of jailbreaking black-box multi-modal LLMs by applying zeroth-order optimization to craft malicious image prompts. It introduces a patch-wise block coordinate descent strategy to reduce gradient estimation variance and memory usage, enabling direct attacks on billion-parameter MLLMs without access to internal parameters. The method achieves high attack success rates across models such as MiniGPT-4, LLaVA1.5, INF-MLLM1, and GPT-4o, while significantly reducing memory demands compared to white-box approaches. This work highlights vulnerabilities in current safety alignments for MLLMs and emphasizes the need for robust defenses, safer API designs, and post-hoc evaluation to mitigate such black-box jailbreak risks.

Abstract

Jailbreaking methods, which induce Multi-modal Large Language Models (MLLMs) to output harmful responses, raise significant safety concerns. Among these methods, gradient-based approaches, which use gradients to generate malicious prompts, have been widely studied due to their high success rates in white-box settings, where full access to the model is available. However, these methods have notable limitations: they require white-box access, which is not always feasible, and involve high memory usage. To address scenarios where white-box access is unavailable, attackers often resort to transfer attacks. In transfer attacks, malicious inputs generated using white-box models are applied to black-box models, but this typically results in reduced attack performance. To overcome these challenges, we propose Zer0-Jack, a method that bypasses the need for white-box access by leveraging zeroth-order optimization. We propose patch coordinate descent to efficiently generate malicious image inputs to directly attack black-box MLLMs, which significantly reduces memory usage further. Through extensive experiments, Zer0-Jack achieves a high attack success rate across various models, surpassing previous transfer-based methods and performing comparably with existing white-box jailbreak techniques. Notably, Zer0-Jack achieves a 95\% attack success rate on MiniGPT-4 with the Harmful Behaviors Multi-modal Dataset on a black-box setting, demonstrating its effectiveness. Additionally, we show that Zer0-Jack can directly attack commercial MLLMs such as GPT-4o. Codes are provided in the supplement.

Zer0-Jack: A Memory-efficient Gradient-based Jailbreaking Method for Black-box Multi-modal Large Language Models

TL;DR

Zer0-Jack tackles the problem of jailbreaking black-box multi-modal LLMs by applying zeroth-order optimization to craft malicious image prompts. It introduces a patch-wise block coordinate descent strategy to reduce gradient estimation variance and memory usage, enabling direct attacks on billion-parameter MLLMs without access to internal parameters. The method achieves high attack success rates across models such as MiniGPT-4, LLaVA1.5, INF-MLLM1, and GPT-4o, while significantly reducing memory demands compared to white-box approaches. This work highlights vulnerabilities in current safety alignments for MLLMs and emphasizes the need for robust defenses, safer API designs, and post-hoc evaluation to mitigate such black-box jailbreak risks.

Abstract

Jailbreaking methods, which induce Multi-modal Large Language Models (MLLMs) to output harmful responses, raise significant safety concerns. Among these methods, gradient-based approaches, which use gradients to generate malicious prompts, have been widely studied due to their high success rates in white-box settings, where full access to the model is available. However, these methods have notable limitations: they require white-box access, which is not always feasible, and involve high memory usage. To address scenarios where white-box access is unavailable, attackers often resort to transfer attacks. In transfer attacks, malicious inputs generated using white-box models are applied to black-box models, but this typically results in reduced attack performance. To overcome these challenges, we propose Zer0-Jack, a method that bypasses the need for white-box access by leveraging zeroth-order optimization. We propose patch coordinate descent to efficiently generate malicious image inputs to directly attack black-box MLLMs, which significantly reduces memory usage further. Through extensive experiments, Zer0-Jack achieves a high attack success rate across various models, surpassing previous transfer-based methods and performing comparably with existing white-box jailbreak techniques. Notably, Zer0-Jack achieves a 95\% attack success rate on MiniGPT-4 with the Harmful Behaviors Multi-modal Dataset on a black-box setting, demonstrating its effectiveness. Additionally, we show that Zer0-Jack can directly attack commercial MLLMs such as GPT-4o. Codes are provided in the supplement.

Paper Structure

This paper contains 33 sections, 10 equations, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: The overview of Zer0-Jack. To effectively attack a black-box MLLM, Zer0-Jack leverages zeroth-order optimization and patch coordinate descent.
  • Figure 2: Ablation studies on different components
  • Figure 3: Influence of patch size on two datasets.
  • Figure 4: Case study illustrating the jailbreak performance of text-based and image-based methods on LLaVA1.5 for the same question with the corresponding image. The first row shows the response generated from text-based methods, AutoDAN, GCG, and PAIR. We also present the text prompt we optimized from white-box methods. The second row compares responses when using P-Image, A-Image, and the optimized image from Zer0-Jack, each paired with the text input.
  • Figure 5: Comparison of average memory cost and iteration efficiency when optimizing a sample on MiniGPT-4. The bar chart represents memory consumption (in GB), while the line graph illustrates iteration efficiency (number of iterations).
  • ...and 4 more figures