$\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models

Fenghua Weng; Yue Xu; Chengyan Fu; Wenjie Wang

$\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models

Fenghua Weng, Yue Xu, Chengyan Fu, Wenjie Wang

TL;DR

A unified and systematic evaluation framework and the first public-available benchmark for MLLM jailbreak research, which assess the effectiveness of various attack methods against SoTA MLLMs and evaluate the impact of defense mechanisms on both defense effectiveness and model utility for normal tasks.

Abstract

As deep learning advances, Large Language Models (LLMs) and their multimodal counterparts, Multimodal Large Language Models (MLLMs), have shown exceptional performance in many real-world tasks. However, MLLMs face significant security challenges, such as jailbreak attacks, where attackers attempt to bypass the model's safety alignment to elicit harmful responses. The threat of jailbreak attacks on MLLMs arises from both the inherent vulnerabilities of LLMs and the multiple information channels that MLLMs process. While various attacks and defenses have been proposed, there is a notable gap in unified and comprehensive evaluations, as each method is evaluated on different dataset and metrics, making it impossible to compare the effectiveness of each method. To address this gap, we introduce \textit{MMJ-Bench}, a unified pipeline for evaluating jailbreak attacks and defense techniques for MLLMs. Through extensive experiments, we assess the effectiveness of various attack methods against SoTA MLLMs and evaluate the impact of defense mechanisms on both defense effectiveness and model utility for normal tasks. Our comprehensive evaluation contribute to the field by offering a unified and systematic evaluation framework and the first public-available benchmark for MLLM jailbreak research. We also demonstrate several insightful findings that highlights directions for future studies.

$\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models

TL;DR

Abstract

Paper Structure (17 sections, 3 equations, 3 figures, 6 tables)

This paper contains 17 sections, 3 equations, 3 figures, 6 tables.

Introduction
Background and Related Work
Jailbreak Attack Threat Model
Jailbreak Attacks in MLLM
Jailbreak Defenses in MLLM
Jailbreak Benchmark for MLLMs
Study Design
Data Collection
Jailbreak Cases Generation
Response Generation
Evaluation
Experiment
Attack Implementation Details
Findings of Jailbreak Attacks
Defense Implementation Details
...and 2 more sections

Figures (3)

Figure 1: Workflow of MMJ-Bench
Figure 2: This graph illustrates ASR of different attack techniques against MLLMs. ASR-Average represents the average ASR of ADV-16, ADV-64 and ADV-inf.
Figure 3: The trade-off between defense effectiveness, measured by the average ASR reduction across all attacks, and model utility on normal tasks, evaluated using the MM-Vet score. The circle markers represent the baseline performance of the vanilla models without any defense, while different markers signify the performance of various defense methods. The lines connecting the vanilla and post-defense performance of each model indicate the change introduced by the defenses. Each color corresponds to a specific target MLLM. Ideally, we aim for a high MM-Vet score (high model utility) and a low ASR (strong defense capacity).

$\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models

TL;DR

Abstract

$\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)