SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models

Junfeng Fang; Yukai Wang; Ruipeng Wang; Zijun Yao; Kun Wang; An Zhang; Xiang Wang; Tat-Seng Chua

SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models

Junfeng Fang, Yukai Wang, Ruipeng Wang, Zijun Yao, Kun Wang, An Zhang, Xiang Wang, Tat-Seng Chua

TL;DR

This work analyzes safety in multi-modal large reasoning models (MLRMs) by comparing them to base MLLMs across standardized jailbreaking tests and safety benchmarks. It introduces OpenSafeMLRM, a toolkit to unify safety evaluation for MLRMs, datasets, and attacks. The study uncovers three main findings: a reasoning tax leading to degraded safety alignment, scenario-specific safety blind spots with dramatic attack rate increases, and nascent emergent self-correction that can override unsafe reasoning. The authors argue for targeted, scenario-aware auditing and defense hardening to ensure that reasoning-enabled AI aligns with ethical safeguards.

Abstract

The rapid advancement of multi-modal large reasoning models (MLRMs) -- enhanced versions of multimodal language models (MLLMs) equipped with reasoning capabilities -- has revolutionized diverse applications. However, their safety implications remain underexplored. While prior work has exposed critical vulnerabilities in unimodal reasoning models, MLRMs introduce distinct risks from cross-modal reasoning pathways. This work presents the first systematic safety analysis of MLRMs through large-scale empirical studies comparing MLRMs with their base MLLMs. Our experiments reveal three critical findings: (1) The Reasoning Tax: Acquiring reasoning capabilities catastrophically degrades inherited safety alignment. MLRMs exhibit 37.44% higher jailbreaking success rates than base MLLMs under adversarial attacks. (2) Safety Blind Spots: While safety degradation is pervasive, certain scenarios (e.g., Illegal Activity) suffer 25 times higher attack rates -- far exceeding the average 3.4 times increase, revealing scenario-specific vulnerabilities with alarming cross-model and datasets consistency. (3) Emergent Self-Correction: Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction -- 16.9% of jailbroken reasoning steps are overridden by safe answers, hinting at intrinsic safeguards. These findings underscore the urgency of scenario-aware safety auditing and mechanisms to amplify MLRMs' self-correction potential. To catalyze research, we open-source OpenSafeMLRM, the first toolkit for MLRM safety evaluation, providing unified interface for mainstream models, datasets, and jailbreaking methods. Our work calls for immediate efforts to harden reasoning-augmented AI, ensuring its transformative potential aligns with ethical safeguards.

SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models

TL;DR

Abstract

SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)