Table of Contents
Fetching ...

How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation

Zhuohang Long, Siyuan Wang, Shujun Liu, Yuhang Lai, Xuanjing Huang, Zhongyu Wei

TL;DR

The paper investigates jailbreak defenses for multimodal LVLMs by reframing generation as a classification task to study model refusals. It identifies two core mechanisms—safety shift (global increase in refusals) and harmfulness discrimination (improved harm-benign separation)—and proposes inter- and intra-mechanism ensembles to balance safety and helpfulness. Through extensive experiments on MM-SafetyBench and MOSSBench with LLaVA-1.5 across multiple defense methods, the authors show that ensembles can substantially improve safety and/or optimize safety-helpfulness trade-offs, with SR+MO and QR|SR emerging as particularly effective variants. The work also examines how fine-tuning and multimodal factors influence safety, provides a practical framework for defense strategy selection, and discusses limitations and future directions for multimodal jailbreak defenses.

Abstract

Jailbreak attacks, where harmful prompts bypass generative models' built-in safety, raise serious concerns about model vulnerability. While many defense methods have been proposed, the trade-offs between safety and helpfulness, and their application to Large Vision-Language Models (LVLMs), are not well understood. This paper systematically examines jailbreak defenses by reframing the standard generation task as a binary classification problem to assess model refusal tendencies for both harmful and benign queries. We identify two key defense mechanisms: safety shift, which increases refusal rates across all queries, and harmfulness discrimination, which improves the model's ability to distinguish between harmful and benign inputs. Using these mechanisms, we develop two ensemble defense strategies-inter-mechanism ensembles and intra-mechanism ensembles-to balance safety and helpfulness. Experiments on the MM-SafetyBench and MOSSBench datasets with LLaVA-1.5 models show that these strategies effectively improve model safety or optimize the trade-off between safety and helpfulness.

How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation

TL;DR

The paper investigates jailbreak defenses for multimodal LVLMs by reframing generation as a classification task to study model refusals. It identifies two core mechanisms—safety shift (global increase in refusals) and harmfulness discrimination (improved harm-benign separation)—and proposes inter- and intra-mechanism ensembles to balance safety and helpfulness. Through extensive experiments on MM-SafetyBench and MOSSBench with LLaVA-1.5 across multiple defense methods, the authors show that ensembles can substantially improve safety and/or optimize safety-helpfulness trade-offs, with SR+MO and QR|SR emerging as particularly effective variants. The work also examines how fine-tuning and multimodal factors influence safety, provides a practical framework for defense strategy selection, and discusses limitations and future directions for multimodal jailbreak defenses.

Abstract

Jailbreak attacks, where harmful prompts bypass generative models' built-in safety, raise serious concerns about model vulnerability. While many defense methods have been proposed, the trade-offs between safety and helpfulness, and their application to Large Vision-Language Models (LVLMs), are not well understood. This paper systematically examines jailbreak defenses by reframing the standard generation task as a binary classification problem to assess model refusal tendencies for both harmful and benign queries. We identify two key defense mechanisms: safety shift, which increases refusal rates across all queries, and harmfulness discrimination, which improves the model's ability to distinguish between harmful and benign inputs. Using these mechanisms, we develop two ensemble defense strategies-inter-mechanism ensembles and intra-mechanism ensembles-to balance safety and helpfulness. Experiments on the MM-SafetyBench and MOSSBench datasets with LLaVA-1.5 models show that these strategies effectively improve model safety or optimize the trade-off between safety and helpfulness.

Paper Structure

This paper contains 43 sections, 5 equations, 21 figures, 13 tables.

Figures (21)

  • Figure 1: Illustration of the safety shift mechanism (shifting towards the same refusal side of the decision boundary) and the harmfulness discrimination mechanism (shifting towards opposite sides of the decision boundary).
  • Figure 2: Baseline
  • Figure 3: Individual Defenses
  • Figure 5: Baseline
  • Figure 6: Inter-Mechanism Ensembles
  • ...and 16 more figures