Table of Contents
Fetching ...

SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models

Junfeng Fang, Yukai Wang, Ruipeng Wang, Zijun Yao, Kun Wang, An Zhang, Xiang Wang, Tat-Seng Chua

TL;DR

This work analyzes safety in multi-modal large reasoning models (MLRMs) by comparing them to base MLLMs across standardized jailbreaking tests and safety benchmarks. It introduces OpenSafeMLRM, a toolkit to unify safety evaluation for MLRMs, datasets, and attacks. The study uncovers three main findings: a reasoning tax leading to degraded safety alignment, scenario-specific safety blind spots with dramatic attack rate increases, and nascent emergent self-correction that can override unsafe reasoning. The authors argue for targeted, scenario-aware auditing and defense hardening to ensure that reasoning-enabled AI aligns with ethical safeguards.

Abstract

The rapid advancement of multi-modal large reasoning models (MLRMs) -- enhanced versions of multimodal language models (MLLMs) equipped with reasoning capabilities -- has revolutionized diverse applications. However, their safety implications remain underexplored. While prior work has exposed critical vulnerabilities in unimodal reasoning models, MLRMs introduce distinct risks from cross-modal reasoning pathways. This work presents the first systematic safety analysis of MLRMs through large-scale empirical studies comparing MLRMs with their base MLLMs. Our experiments reveal three critical findings: (1) The Reasoning Tax: Acquiring reasoning capabilities catastrophically degrades inherited safety alignment. MLRMs exhibit 37.44% higher jailbreaking success rates than base MLLMs under adversarial attacks. (2) Safety Blind Spots: While safety degradation is pervasive, certain scenarios (e.g., Illegal Activity) suffer 25 times higher attack rates -- far exceeding the average 3.4 times increase, revealing scenario-specific vulnerabilities with alarming cross-model and datasets consistency. (3) Emergent Self-Correction: Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction -- 16.9% of jailbroken reasoning steps are overridden by safe answers, hinting at intrinsic safeguards. These findings underscore the urgency of scenario-aware safety auditing and mechanisms to amplify MLRMs' self-correction potential. To catalyze research, we open-source OpenSafeMLRM, the first toolkit for MLRM safety evaluation, providing unified interface for mainstream models, datasets, and jailbreaking methods. Our work calls for immediate efforts to harden reasoning-augmented AI, ensuring its transformative potential aligns with ethical safeguards.

SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models

TL;DR

This work analyzes safety in multi-modal large reasoning models (MLRMs) by comparing them to base MLLMs across standardized jailbreaking tests and safety benchmarks. It introduces OpenSafeMLRM, a toolkit to unify safety evaluation for MLRMs, datasets, and attacks. The study uncovers three main findings: a reasoning tax leading to degraded safety alignment, scenario-specific safety blind spots with dramatic attack rate increases, and nascent emergent self-correction that can override unsafe reasoning. The authors argue for targeted, scenario-aware auditing and defense hardening to ensure that reasoning-enabled AI aligns with ethical safeguards.

Abstract

The rapid advancement of multi-modal large reasoning models (MLRMs) -- enhanced versions of multimodal language models (MLLMs) equipped with reasoning capabilities -- has revolutionized diverse applications. However, their safety implications remain underexplored. While prior work has exposed critical vulnerabilities in unimodal reasoning models, MLRMs introduce distinct risks from cross-modal reasoning pathways. This work presents the first systematic safety analysis of MLRMs through large-scale empirical studies comparing MLRMs with their base MLLMs. Our experiments reveal three critical findings: (1) The Reasoning Tax: Acquiring reasoning capabilities catastrophically degrades inherited safety alignment. MLRMs exhibit 37.44% higher jailbreaking success rates than base MLLMs under adversarial attacks. (2) Safety Blind Spots: While safety degradation is pervasive, certain scenarios (e.g., Illegal Activity) suffer 25 times higher attack rates -- far exceeding the average 3.4 times increase, revealing scenario-specific vulnerabilities with alarming cross-model and datasets consistency. (3) Emergent Self-Correction: Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction -- 16.9% of jailbroken reasoning steps are overridden by safe answers, hinting at intrinsic safeguards. These findings underscore the urgency of scenario-aware safety auditing and mechanisms to amplify MLRMs' self-correction potential. To catalyze research, we open-source OpenSafeMLRM, the first toolkit for MLRM safety evaluation, providing unified interface for mainstream models, datasets, and jailbreaking methods. Our work calls for immediate efforts to harden reasoning-augmented AI, ensuring its transformative potential aligns with ethical safeguards.

Paper Structure

This paper contains 11 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Comparison of ASR and HR scores across various MLRMs and their base MLLMs under vanilla unsafe text queries. The dark blue bars represent the ASR and HR of the base MLLMs. Best viewed in color.
  • Figure 2: Comparison of ASR and HR scores across MLRMs and their base MLLMs under jailbreak attacks. The dark blue bars represent the ASR and HR of the base MLLMs. Best viewed in color.
  • Figure 3: Comparison of ASR scores across different MLRMs and their base MLLMs. For abbreviations, va. and ja. refer to performance under vanilla unsafe text queries and jailbreak attacks, respectively. We use MB to denote MLRM that are developed with MBerry method.
  • Figure 4: The relationship between reasoning safety and answer safety, where the horizontal and vertical axes represent HR scores. The numbers in the color blocks represent the normalized probabilities, with deeper colors indicating higher probabilities. Best viewed in color.