Table of Contents
Fetching ...

Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models

Fenghua Weng, Chaochao Lu, Xia Hu, Wenqi Shao, Wenjie Wang

TL;DR

Think-Reflect-Revise (TRR) addresses safety alignment in LVLMs by adding explicit policy-guided reflection and revision to the usual reasoning process. It introduces ReSafe, a 5k-sample dataset with think–reflect–revise structure, and uses SFT followed by GRPO reinforcement learning to instill reflective behavior. Across safety benchmarks and jailbreak attacks, TRR yields substantial gains in safe response rates while preserving or enhancing general multimodal reasoning, outperforming several baselines and matching frontier models on safety. The work demonstrates that explicit, policy-driven self-correction can robustly defend against visually concealed safety violations in multimodal models and suggests directions for broader safety alignment.

Abstract

As multimodal reasoning improves the overall capabilities of Large Vision Language Models (LVLMs), recent studies have begun to explore safety-oriented reasoning, aiming to enhance safety awareness by analyzing potential safety risks during the reasoning process before generating the final response. Although such approaches improve safety awareness and interpretability, this single-pass think-then-answer paradigm remains vulnerable to contextual or visual jailbreak attacks. This reveals a critical flaw: single-pass reasoning may overlook explicit harmful content in its own output. Our key insight is to exploit this wasted signal through reflection, which can effectively leverage the malicious content revealed in the first-pass reasoning to enable genuine self-correction and prevent unsafe generations. Motivated by this, we propose Think-Reflect-Revise (TRR), a three-stage training framework designed to enhance the safety alignment of LVLMs through policy-guided self-reflection. We first build a Reflective Safety Reasoning (ReSafe) dataset with 5,000 examples that follow a think-reflect-revise process. We then fine-tune the target model using the ReSafe dataset to initialize reflective behavior, and finally reinforce policy-guided reflection through reinforcement learning. Experimental results show that TRR substantially improves the safety performance of LVLMs across both safety-awareness benchmarks and jailbreak attack evaluations, increasing the overall safe response rate from 42.8% to 87.7% on Qwen2.5-VL-7B, while preserving stable performance on general benchmarks such as MMMU and MMStar. The project page is available at https://think-reflect-revise.github.io/.

Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models

TL;DR

Think-Reflect-Revise (TRR) addresses safety alignment in LVLMs by adding explicit policy-guided reflection and revision to the usual reasoning process. It introduces ReSafe, a 5k-sample dataset with think–reflect–revise structure, and uses SFT followed by GRPO reinforcement learning to instill reflective behavior. Across safety benchmarks and jailbreak attacks, TRR yields substantial gains in safe response rates while preserving or enhancing general multimodal reasoning, outperforming several baselines and matching frontier models on safety. The work demonstrates that explicit, policy-driven self-correction can robustly defend against visually concealed safety violations in multimodal models and suggests directions for broader safety alignment.

Abstract

As multimodal reasoning improves the overall capabilities of Large Vision Language Models (LVLMs), recent studies have begun to explore safety-oriented reasoning, aiming to enhance safety awareness by analyzing potential safety risks during the reasoning process before generating the final response. Although such approaches improve safety awareness and interpretability, this single-pass think-then-answer paradigm remains vulnerable to contextual or visual jailbreak attacks. This reveals a critical flaw: single-pass reasoning may overlook explicit harmful content in its own output. Our key insight is to exploit this wasted signal through reflection, which can effectively leverage the malicious content revealed in the first-pass reasoning to enable genuine self-correction and prevent unsafe generations. Motivated by this, we propose Think-Reflect-Revise (TRR), a three-stage training framework designed to enhance the safety alignment of LVLMs through policy-guided self-reflection. We first build a Reflective Safety Reasoning (ReSafe) dataset with 5,000 examples that follow a think-reflect-revise process. We then fine-tune the target model using the ReSafe dataset to initialize reflective behavior, and finally reinforce policy-guided reflection through reinforcement learning. Experimental results show that TRR substantially improves the safety performance of LVLMs across both safety-awareness benchmarks and jailbreak attack evaluations, increasing the overall safe response rate from 42.8% to 87.7% on Qwen2.5-VL-7B, while preserving stable performance on general benchmarks such as MMMU and MMStar. The project page is available at https://think-reflect-revise.github.io/.

Paper Structure

This paper contains 23 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Comparison of the responses produced by SafeWork-R1-7B and Qwen2.5-VL-7B trained with our Think-Reflect-Revise when subjected to the MML-Mirror jailbreak attack. MML-Mirror embeds harmful intent by first encoding malicious queries into images through typographic rendering, then mirroring the images to obscure the harmful content. The attack is further contextualized within a fictional game development scenario, prompting the model to complete the visual content in a manner aligned with the villain’s objectives.
  • Figure 2: Overview of Think-Reflect-Revise (TRR). TRR comprises three stages: (1) ReSafe Dataset Construction, in which we construct a dataset of think-reflect-revise examples. (2) Reflective Behavior Initialization, where the target model is fine-tuned to initialize reflective reasoning; and (3) Reflective Behavior Enhancement, in which we further strengthen the reflective behavior through reinforcement learning.
  • Figure 3: Ablation study on safety training stages of TRR.
  • Figure 4: Improvement in safe rate across safety categories of MML-M attack after self-reflection of Qwen2.5-VL-7B trained with TRR.