Table of Contents
Fetching ...

Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue, Xiting Wang

TL;DR

This work tackles the challenge of detecting unknown jailbreak attacks on large vision-language models by moving from attack-specific learning to task-focused detection. It introduces the Learning to Detect (LoD) framework, which uses Multi-modal Safety Concept Activation Vectors (MSCAV) for safety-aware representations and a Safety Pattern Auto-Encoder (SPAE) for capturing inter-layer safety patterns through anomaly detection. Across three LVLMs and multiple attack types, LoD achieves superior AUROC and robustness with significant improvements over baselines, while maintaining high efficiency and requiring no attack-specific training data. The approach provides a practical, generalizable defense against unseen jailbreaks and offers strong potential for deployment in multimodal safety pipelines.

Abstract

Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.

Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

TL;DR

This work tackles the challenge of detecting unknown jailbreak attacks on large vision-language models by moving from attack-specific learning to task-focused detection. It introduces the Learning to Detect (LoD) framework, which uses Multi-modal Safety Concept Activation Vectors (MSCAV) for safety-aware representations and a Safety Pattern Auto-Encoder (SPAE) for capturing inter-layer safety patterns through anomaly detection. Across three LVLMs and multiple attack types, LoD achieves superior AUROC and robustness with significant improvements over baselines, while maintaining high efficiency and requiring no attack-specific training data. The approach provides a practical, generalizable defense against unseen jailbreaks and offers strong potential for deployment in multimodal safety pipelines.

Abstract

Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.

Paper Structure

This paper contains 20 sections, 7 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Comparison of attack detection methods and illustration of types of data used.
  • Figure 2: Overview of our Learning to Detect (LoD) framework, which consists of (a) a representation learning module and (b) an attack classification module. In our framework, attacked inputs are only seen during testing. During training, the representation learning module is trained on safe and unsafe inputs that are not attacked to extract safety-related representations (MSCAVs). The Safety Pattern Auto-Encoder is then trained exclusively on MSCAVs of safe inputs to model their typical distribution. During testing, the Safety Pattern Auto-Encoder reconstructs input MSCAVs. Attacked inputs, whose patterns differ from the learned typical distribution of safe MSCAVs, produce high reconstruction errors, while safe inputs yield low reconstruction errors.
  • Figure 3: The average MSCAV of safe inputs, unsafe inputs, and inputs attacked by unknown jailbreak methods (marked in brackets). MSCAVs possess the discriminative power for distinguishing attacks from safe inputs.
  • Figure 4: Test accuracy of MSCAV classifiers across layers in LVLMs. High accuracy indicates safe and unsafe inputs are linearly at the corresponding layer.
  • Figure 5: Ablation study results showing the average AUROC scores on three LVLMs across five attack methods under different ablation settings. Both MSCAV and SPAE contribute to the detection accuracy.
  • ...and 1 more figures