Table of Contents
Fetching ...

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Xiaojun Jia, Jie Liao, Qi Guo, Teng Ma, Simeng Qin, Ranjie Duan, Tianlin Li, Yihao Huang, Zhitao Zeng, Dongxian Wu, Yiming Li, Wenqi Ren, Xiaochun Cao, Yang Liu

TL;DR

This work addresses the safety vulnerabilities of multimodal large language models (MLLMs) by introducing OmniSafeBench-MM, a unified benchmark and toolbox for multimodal jailbreak attack-defense evaluation.It unifies a large-scale, automated risk-data generation pipeline, 13 attack methods, 15 defense strategies, and a three-dimensional Harmfulness-Alignment-Detail (H-F-A-D) scoring protocol to enable nuanced, reproducible evaluations beyond traditional ASR metrics.Extensive experiments across 18 MLLMs—encompassing open-source and commercial systems—reveal persistent vulnerabilities, especially under cross-modal and black-box conditions, and highlight complex defense trade-offs where some protections reduce harm at the cost of usefulness.By providing modular, open-source data, methods, and evaluation tools, OmniSafeBench-MM establishes a scalable foundation for advancing multimodal safety research and standardized benchmarking.

Abstract

Recent advances in multi-modal large language models (MLLMs) have enabled unified perception-reasoning capabilities, yet these systems remain highly vulnerable to jailbreak attacks that bypass safety alignment and induce harmful behaviors. Existing benchmarks such as JailBreakV-28K, MM-SafetyBench, and HADES provide valuable insights into multi-modal vulnerabilities, but they typically focus on limited attack scenarios, lack standardized defense evaluation, and offer no unified, reproducible toolbox. To address these gaps, we introduce OmniSafeBench-MM, which is a comprehensive toolbox for multi-modal jailbreak attack-defense evaluation. OmniSafeBench-MM integrates 13 representative attack methods, 15 defense strategies, and a diverse dataset spanning 9 major risk domains and 50 fine-grained categories, structured across consultative, imperative, and declarative inquiry types to reflect realistic user intentions. Beyond data coverage, it establishes a three-dimensional evaluation protocol measuring (1) harmfulness, distinguished by a granular, multi-level scale ranging from low-impact individual harm to catastrophic societal threats, (2) intent alignment between responses and queries, and (3) response detail level, enabling nuanced safety-utility analysis. We conduct extensive experiments on 10 open-source and 8 closed-source MLLMs to reveal their vulnerability to multi-modal jailbreak. By unifying data, methodology, and evaluation into an open-source, reproducible platform, OmniSafeBench-MM provides a standardized foundation for future research. The code is released at https://github.com/jiaxiaojunQAQ/OmniSafeBench-MM.

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

TL;DR

This work addresses the safety vulnerabilities of multimodal large language models (MLLMs) by introducing OmniSafeBench-MM, a unified benchmark and toolbox for multimodal jailbreak attack-defense evaluation.It unifies a large-scale, automated risk-data generation pipeline, 13 attack methods, 15 defense strategies, and a three-dimensional Harmfulness-Alignment-Detail (H-F-A-D) scoring protocol to enable nuanced, reproducible evaluations beyond traditional ASR metrics.Extensive experiments across 18 MLLMs—encompassing open-source and commercial systems—reveal persistent vulnerabilities, especially under cross-modal and black-box conditions, and highlight complex defense trade-offs where some protections reduce harm at the cost of usefulness.By providing modular, open-source data, methods, and evaluation tools, OmniSafeBench-MM establishes a scalable foundation for advancing multimodal safety research and standardized benchmarking.

Abstract

Recent advances in multi-modal large language models (MLLMs) have enabled unified perception-reasoning capabilities, yet these systems remain highly vulnerable to jailbreak attacks that bypass safety alignment and induce harmful behaviors. Existing benchmarks such as JailBreakV-28K, MM-SafetyBench, and HADES provide valuable insights into multi-modal vulnerabilities, but they typically focus on limited attack scenarios, lack standardized defense evaluation, and offer no unified, reproducible toolbox. To address these gaps, we introduce OmniSafeBench-MM, which is a comprehensive toolbox for multi-modal jailbreak attack-defense evaluation. OmniSafeBench-MM integrates 13 representative attack methods, 15 defense strategies, and a diverse dataset spanning 9 major risk domains and 50 fine-grained categories, structured across consultative, imperative, and declarative inquiry types to reflect realistic user intentions. Beyond data coverage, it establishes a three-dimensional evaluation protocol measuring (1) harmfulness, distinguished by a granular, multi-level scale ranging from low-impact individual harm to catastrophic societal threats, (2) intent alignment between responses and queries, and (3) response detail level, enabling nuanced safety-utility analysis. We conduct extensive experiments on 10 open-source and 8 closed-source MLLMs to reveal their vulnerability to multi-modal jailbreak. By unifying data, methodology, and evaluation into an open-source, reproducible platform, OmniSafeBench-MM provides a standardized foundation for future research. The code is released at https://github.com/jiaxiaojunQAQ/OmniSafeBench-MM.

Paper Structure

This paper contains 27 sections, 6 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Overview of OmniSafeBench-MM. The benchmark unifies multi-modal jailbreak attack–defense evaluation, 13 attack and 15 defense methods, and a three-dimensional scoring protocol measuring harmfulness, alignment, and detail.
  • Figure 2: The framework of the proposed method
  • Figure 3: Safety taxonomy of our OmniSafeBench-MM.
  • Figure 4: Vulnerability Heatmap of 16 MLLMs against 8 Jailbreak Attacks. The heatmap visualizes the Attack Success Rate (ASR %), where darker red indicates higher vulnerability (higher ASR) and lighter yellow indicates higher robustness. Models are categorized into Open Source (left) and Closed Source (right), sorted by their average vulnerability within each group. The 'Avg.' row and column denote the mean ASR for each model and attack method, respectively.
  • Figure 5: Radar plot illustrating model-wise ASR distribution under each attack method. The plot highlights structural differences in attack effectiveness across models.
  • ...and 6 more figures