Table of Contents
Fetching ...

SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models

Huahui Yi, Kun Wang, Qiankun Li, Miao Yu, Liang Lin, Gongli Xi, Hao Wu, Xuming Hu, Kang Li, Yang Liu

TL;DR

SaFeR-VLM addresses the gap where multimodal reasoning models expose safety risks by embedding safety directly into the reasoning process. It introduces a four-component framework—QI-Safe-10K data curation, safety-aware rollout with reflection, structured reward modeling, and GRPO-based optimization—that actively guides reasoning toward safe trajectories. Empirical results on six benchmarks show state-of-the-art safety at 3B and 7B scales while maintaining strong helpfulness, including distributional robustness and effective handling of implicit risks. The work demonstrates a scalable, interpretable approach to safety-aware multimodal reasoning with practical implications for deploying trustworthy AI in cross-modal tasks.

Abstract

Multimodal Large Reasoning Models (MLRMs) demonstrate impressive cross-modal reasoning but often amplify safety risks under adversarial or unsafe prompts, a phenomenon we call the \textit{Reasoning Tax}. Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models exposed to implicit risks. In this paper, we propose SaFeR-VLM, a safety-aligned reinforcement learning framework that embeds safety directly into multimodal reasoning. The framework integrates four components: (I) QI-Safe-10K, a curated dataset emphasizing safety-critical and reasoning-sensitive cases; (II) safety-aware rollout, where unsafe generations undergo reflection and correction instead of being discarded; (III) structured reward modeling with multi-dimensional weighted criteria and explicit penalties for hallucinations and contradictions; and (IV) GRPO optimization, which reinforces both safe and corrected trajectories. This unified design shifts safety from a passive safeguard to an active driver of reasoning, enabling scalable and generalizable safety-aware reasoning. SaFeR-VLM further demonstrates robustness against both explicit and implicit risks, supporting dynamic and interpretable safety decisions beyond surface-level filtering. SaFeR-VLM-3B achieves average performance $70.13$ and $78.97$ on safety and helpfulness across six benchmarks, surpassing both same-scale and $>10\times$ larger models such as Skywork-R1V3-38B, Qwen2.5VL-72B, and GLM4.5V-106B. Remarkably, SaFeR-VLM-7B benefits from its increased scale to surpass GPT-5-mini and Gemini-2.5-Flash by \num{6.47} and \num{16.76} points respectively on safety metrics, achieving this improvement without any degradation in helpfulness performance. Our codes are available at https://github.com/HarveyYi/SaFeR-VLM.

SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models

TL;DR

SaFeR-VLM addresses the gap where multimodal reasoning models expose safety risks by embedding safety directly into the reasoning process. It introduces a four-component framework—QI-Safe-10K data curation, safety-aware rollout with reflection, structured reward modeling, and GRPO-based optimization—that actively guides reasoning toward safe trajectories. Empirical results on six benchmarks show state-of-the-art safety at 3B and 7B scales while maintaining strong helpfulness, including distributional robustness and effective handling of implicit risks. The work demonstrates a scalable, interpretable approach to safety-aware multimodal reasoning with practical implications for deploying trustworthy AI in cross-modal tasks.

Abstract

Multimodal Large Reasoning Models (MLRMs) demonstrate impressive cross-modal reasoning but often amplify safety risks under adversarial or unsafe prompts, a phenomenon we call the \textit{Reasoning Tax}. Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models exposed to implicit risks. In this paper, we propose SaFeR-VLM, a safety-aligned reinforcement learning framework that embeds safety directly into multimodal reasoning. The framework integrates four components: (I) QI-Safe-10K, a curated dataset emphasizing safety-critical and reasoning-sensitive cases; (II) safety-aware rollout, where unsafe generations undergo reflection and correction instead of being discarded; (III) structured reward modeling with multi-dimensional weighted criteria and explicit penalties for hallucinations and contradictions; and (IV) GRPO optimization, which reinforces both safe and corrected trajectories. This unified design shifts safety from a passive safeguard to an active driver of reasoning, enabling scalable and generalizable safety-aware reasoning. SaFeR-VLM further demonstrates robustness against both explicit and implicit risks, supporting dynamic and interpretable safety decisions beyond surface-level filtering. SaFeR-VLM-3B achieves average performance and on safety and helpfulness across six benchmarks, surpassing both same-scale and larger models such as Skywork-R1V3-38B, Qwen2.5VL-72B, and GLM4.5V-106B. Remarkably, SaFeR-VLM-7B benefits from its increased scale to surpass GPT-5-mini and Gemini-2.5-Flash by \num{6.47} and \num{16.76} points respectively on safety metrics, achieving this improvement without any degradation in helpfulness performance. Our codes are available at https://github.com/HarveyYi/SaFeR-VLM.

Paper Structure

This paper contains 38 sections, 18 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 2: Overview of SaFeR-VLM, a safety-aligned RL framework. QI-Safe-10K is curated with QI-Box filtering for balanced quality and instability. Safety-Aware Rollout corrects unsafe outputs before scoring. Reward Modeling aggregates weighted sub-criteria with penalties, and Safety-Aware Optimization integrates safe and corrected trajectories to reinforce consistent safe reasoning.
  • Figure 3: Safety score distributions on six benchmarks, comparing baseline models with our SaFeR-VLM-7B, which achieves consistently higher and more stable performance.
  • Figure 4: QI-Box curation with Qwen2.5VL-7B. Left. Selected Quality–Instability region. Right. Ablations on three datasets, with the chosen region giving the best Safety and Helpfulness.
  • Figure 5: Impact of prompt design and reward model choice on generative reward modeling with Qwen2.5VL-3B. Left. Weighted Criteria prompts outperform simpler rules. Right.Larger and safety-aligned models further boost Safety and Helpfulness, with GRM-RL-7B best.
  • Figure 6: Case study from SIUO. Unlike GPT-5-Mini and Gemini-2.5-Flash, SaFeR-VLM(7B) actively identifies hidden risks, refuses unsafe requests, and redirects the user toward safe alternatives, exemplifying safety-aware reasoning in practice.
  • ...and 1 more figures