Table of Contents
Fetching ...

MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models

Yinan Xia, Yilei Jiang, Yingshui Tan, Xiaoyong Zhu, Xiangyu Yue, Bo Zheng

TL;DR

MSR-Align tackles safety risks in reasoning-capable vision-language models by introducing a policy-grounded, multimodal safety reasoning dataset. The dataset enables supervision of the full multimodal chain-of-thought by grounding reasoning traces in text and image with standardized safety policies. Empirical results show that fine-tuning VLMs on MSR-Align significantly improves safety against textual and vision-language jailbreak attacks while preserving or enhancing general multimodal reasoning. The work demonstrates that high-quality, policy-aligned multimodal reasoning data is a scalable path toward safer, more robust VLMs.

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities. However, this advancement also introduces novel safety risks, as these models become increasingly vulnerable to harmful multimodal prompts that can trigger unethical or unsafe behaviors. Existing safety alignment approaches, primarily designed for unimodal language models, fall short in addressing the complex and nuanced threats posed by multimodal inputs. Moreover, current safety datasets lack the fine-grained, policy-grounded reasoning required to robustly align reasoning-capable VLMs. In this work, we introduce {MSR-Align}, a high-quality Multimodal Safety Reasoning dataset tailored to bridge this gap. MSR-Align supports fine-grained, deliberative reasoning over standardized safety policies across both vision and text modalities. Our data generation pipeline emphasizes multimodal diversity, policy-grounded reasoning, and rigorous quality filtering using strong multimodal judges. Extensive experiments demonstrate that fine-tuning VLMs on MSR-Align substantially improves robustness against both textual and vision-language jailbreak attacks, while preserving or enhancing general reasoning performance. MSR-Align provides a scalable and effective foundation for advancing the safety alignment of reasoning-capable VLMs. Our dataset is made publicly available at https://huggingface.co/datasets/Leigest/MSR-Align.

MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models

TL;DR

MSR-Align tackles safety risks in reasoning-capable vision-language models by introducing a policy-grounded, multimodal safety reasoning dataset. The dataset enables supervision of the full multimodal chain-of-thought by grounding reasoning traces in text and image with standardized safety policies. Empirical results show that fine-tuning VLMs on MSR-Align significantly improves safety against textual and vision-language jailbreak attacks while preserving or enhancing general multimodal reasoning. The work demonstrates that high-quality, policy-aligned multimodal reasoning data is a scalable path toward safer, more robust VLMs.

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities. However, this advancement also introduces novel safety risks, as these models become increasingly vulnerable to harmful multimodal prompts that can trigger unethical or unsafe behaviors. Existing safety alignment approaches, primarily designed for unimodal language models, fall short in addressing the complex and nuanced threats posed by multimodal inputs. Moreover, current safety datasets lack the fine-grained, policy-grounded reasoning required to robustly align reasoning-capable VLMs. In this work, we introduce {MSR-Align}, a high-quality Multimodal Safety Reasoning dataset tailored to bridge this gap. MSR-Align supports fine-grained, deliberative reasoning over standardized safety policies across both vision and text modalities. Our data generation pipeline emphasizes multimodal diversity, policy-grounded reasoning, and rigorous quality filtering using strong multimodal judges. Extensive experiments demonstrate that fine-tuning VLMs on MSR-Align substantially improves robustness against both textual and vision-language jailbreak attacks, while preserving or enhancing general reasoning performance. MSR-Align provides a scalable and effective foundation for advancing the safety alignment of reasoning-capable VLMs. Our dataset is made publicly available at https://huggingface.co/datasets/Leigest/MSR-Align.

Paper Structure

This paper contains 30 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Distribution of safety-relevant categories in the MSR-Align dataset. The pie chart illustrates the proportion of data samples across 20 different risk areas, including dangerous behavior (three subcategories), psychological harm, environmental damage, and more, reflecting the dataset’s comprehensive coverage of multimodal safety concerns.
  • Figure 2: Overview of the pipeline. The system consists of four key stages: (1) Data Collection, which gathers image-instruction pairs across 20 risk categories; (2) Policy-Grounded Multimodal Reasoning, where GPT-4o generates chain-of-thought (CoT) rationales and safe responses based on policy constraints; (3) High-Quality Data Filtering, which evaluates the outputs based on safety, policy alignment, coherence, and fluency; and (4) Safety Alignment, where the filtered data is used to fine-tune an unsafe model into a safe one through supervised fine-tuning (SFT).