Can Large Language Models Automatically Jailbreak GPT-4V?

Yuanwei Wu; Yue Huang; Yixin Liu; Xiang Li; Pan Zhou; Lichao Sun

Can Large Language Models Automatically Jailbreak GPT-4V?

Yuanwei Wu, Yue Huang, Yixin Liu, Xiang Li, Pan Zhou, Lichao Sun

TL;DR

The paper investigates vulnerabilities in GPT-4V related to facial recognition by introducing AutoJailbreak, an automated jailbreak framework that uses LLMs to optimize prompts. It combines weak-to-strong prompting, suffix-based attack enhancements, and an efficient hypothesis-testing–driven search to achieve an Attack Success Rate exceeding $95.3\%$ in black-box settings. Key contributions include a three-stage AutoJailbreak method, empirical evidence of GPT-4V’s susceptibility across celebrity datasets, and semantic analyses of jailbreak prompts and adversarial text. The work underscores the need for stronger safety and privacy protections in multimodal LLMs and motivates future defenses beyond human-tuned prompts and standard moderation.

Abstract

GPT-4V has attracted considerable attention due to its extraordinary capacity for integrating and processing multimodal information. At the same time, its ability of face recognition raises new safety concerns of privacy leakage. Despite researchers' efforts in safety alignment through RLHF or preprocessing filters, vulnerabilities might still be exploited. In our study, we introduce AutoJailbreak, an innovative automatic jailbreak technique inspired by prompt optimization. We leverage Large Language Models (LLMs) for red-teaming to refine the jailbreak prompt and employ weak-to-strong in-context learning prompts to boost efficiency. Furthermore, we present an effective search method that incorporates early stopping to minimize optimization time and token expenditure. Our experiments demonstrate that AutoJailbreak significantly surpasses conventional methods, achieving an Attack Success Rate (ASR) exceeding 95.3\%. This research sheds light on strengthening GPT-4V security, underscoring the potential for LLMs to be exploited in compromising GPT-4V integrity.

Can Large Language Models Automatically Jailbreak GPT-4V?

TL;DR

in black-box settings. Key contributions include a three-stage AutoJailbreak method, empirical evidence of GPT-4V’s susceptibility across celebrity datasets, and semantic analyses of jailbreak prompts and adversarial text. The work underscores the need for stronger safety and privacy protections in multimodal LLMs and motivates future defenses beyond human-tuned prompts and standard moderation.

Abstract

Paper Structure (26 sections, 5 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 26 sections, 5 equations, 7 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Jailbreak Attack
Trustworthiness in Multimodal Large Language Models (MLLMs)
Methodology
Problem Formulation
AutoJailbreak
Weak-to-strong Prompt Optimization
Suffix-based Attack Enhancement
Efficient Search with Hypothesis Testing
Experiment
Experiment Setting
Model and Dataset
Baselines
Evaluation Metrics
...and 11 more sections

Figures (7)

Figure 1: An example of AutoJailbreak.
Figure 2: The framework of our AutoJailbreak. Our method has three stages: prompt pool construction, prompt evaluation, and weak-to-strong contrastive prompting. In the first step, we prompt LLMs to randomly generate a pool of jailbreak prompts. In the prompt evaluation stage, we use a GPT-4V to score each prompt with some recognition success rates (RSR). In the third stage, we split the prompts into two sets, a weak pool, and a strong pool, based on threshold value. Then we prompt LLMs again with sampled prompts from both pools to perform a novel weak-to-strong prompting, leading to a stronger jailbreak prompt for malicious facial identity inference attack.
Figure 3: Recognition success rate (RSR) between different prompt templates by using different red-team LLMs (ChatGPT/GPT-4). For the distribution visualization, we leverage Gaussian kernel density estimation Gaussian. Table \ref{['table:statistics-table']} illustrates the specific data of this figure.
Figure 4: Sample UMAP dimensionality reduction (neighbors = 15, minimum distance = 0.1)
Figure 5: Sample UMAP dimensionality reduction (neighbors = 15, minimum distance = 0.1). Prompt introducing solar eclipse (green) and the jailbreak prompt generated from the GPT-4 traditional template (orange).
...and 2 more figures

Can Large Language Models Automatically Jailbreak GPT-4V?

TL;DR

Abstract

Can Large Language Models Automatically Jailbreak GPT-4V?

Authors

TL;DR

Abstract

Table of Contents

Figures (7)