Table of Contents
Fetching ...

Towards Policy-Adaptive Image Guardrail: Benchmark and Method

Caiyong Piao, Zhiyuan Yan, Haoming Xu, Yunzhen Zhao, Kaiqing Lin, Feiyang Xu, Shuigeng Zhou

TL;DR

SafeGuard-VL is introduced, a reinforcement learning-based method with verifiable rewards (RLVR) for robust unsafe-image guardrails that explicitly optimizes the model with policy-grounded rewards, promoting verifiable adaptation across evolving policies.

Abstract

Accurate rejection of sensitive or harmful visual content, i.e., harmful image guardrail, is critical in many application scenarios. This task must continuously adapt to the evolving safety policies and content across various domains and over time. However, traditional classifiers, confined to fixed categories, require frequent retraining when new policies are introduced. Vision-language models (VLMs) offer a more adaptable and generalizable foundation for dynamic safety guardrails. Despite this potential, existing VLM-based safeguarding methods are typically trained and evaluated under only a fixed safety policy. We find that these models are heavily overfitted to the seen policy, fail to generalize to unseen policies, and even lose the basic instruction-following ability and general knowledge. To address this issue, in this paper we make two key contributions. First, we benchmark the cross-policy generalization performance of existing VLMs with SafeEditBench, a new evaluation suite. SafeEditBench leverages image-editing models to convert unsafe images into safe counterparts, producing policy-aligned datasets where each safe-unsafe image pair remains visually similar except for localized regions violating specific safety rules. Human annotators then provide accurate safe/unsafe labels under five distinct policies, enabling fine-grained assessment of policy-aware generalization. Second, we introduce SafeGuard-VL, a reinforcement learning-based method with verifiable rewards (RLVR) for robust unsafe-image guardrails. Instead of relying solely on supervised fine-tuning (SFT) under fixed policies, SafeGuard-VL explicitly optimizes the model with policy-grounded rewards, promoting verifiable adaptation across evolving policies. Extensive experiments verify the effectiveness of our method for unsafe image guardrails across various policies.

Towards Policy-Adaptive Image Guardrail: Benchmark and Method

TL;DR

SafeGuard-VL is introduced, a reinforcement learning-based method with verifiable rewards (RLVR) for robust unsafe-image guardrails that explicitly optimizes the model with policy-grounded rewards, promoting verifiable adaptation across evolving policies.

Abstract

Accurate rejection of sensitive or harmful visual content, i.e., harmful image guardrail, is critical in many application scenarios. This task must continuously adapt to the evolving safety policies and content across various domains and over time. However, traditional classifiers, confined to fixed categories, require frequent retraining when new policies are introduced. Vision-language models (VLMs) offer a more adaptable and generalizable foundation for dynamic safety guardrails. Despite this potential, existing VLM-based safeguarding methods are typically trained and evaluated under only a fixed safety policy. We find that these models are heavily overfitted to the seen policy, fail to generalize to unseen policies, and even lose the basic instruction-following ability and general knowledge. To address this issue, in this paper we make two key contributions. First, we benchmark the cross-policy generalization performance of existing VLMs with SafeEditBench, a new evaluation suite. SafeEditBench leverages image-editing models to convert unsafe images into safe counterparts, producing policy-aligned datasets where each safe-unsafe image pair remains visually similar except for localized regions violating specific safety rules. Human annotators then provide accurate safe/unsafe labels under five distinct policies, enabling fine-grained assessment of policy-aware generalization. Second, we introduce SafeGuard-VL, a reinforcement learning-based method with verifiable rewards (RLVR) for robust unsafe-image guardrails. Instead of relying solely on supervised fine-tuning (SFT) under fixed policies, SafeGuard-VL explicitly optimizes the model with policy-grounded rewards, promoting verifiable adaptation across evolving policies. Extensive experiments verify the effectiveness of our method for unsafe image guardrails across various policies.
Paper Structure (17 sections, 7 figures, 6 tables)

This paper contains 17 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: High-level illustration of our SafeGuard-VL. Unlike prior guardrails that fit only the fixed safety policy, SafeGuard-VL is designed from the perspective of cross-policy adaptability and robustness. In Stage 1 (SFT), the model learns general unsafe-related visual and textual semantics through our unsafe recaption and data construction pipeline. In Stage 2 (RL), the model is optimized to perform policy-aware safe/unsafe discrimination, adapting its decisions to different policy definitions rather than relying on a single fixed rule set. This two-stage framework enables SafeGuard-VL to generalize to unseen or shifting safety policies during testing.
  • Figure 2: Examples from the proposed SafeEditBench dataset. Our key innovation lies in constructing semantically aligned safe-unsafe image pairs where the global visual semantics remain unchanged, while only the minimal unsafe regions are locally edited using precise image-editing operations. This produces safe counterparts that preserve the original scene, composition, and objects, altering solely the safety-violating content. Such fine-grained, locality-preserving edits make SafeEditBench highly challenging: models must accurately identify and reason about the specific unsafe elements rather than relying on coarse, scene-level cues.
  • Figure 3: The proposed novel self-recaptioning mechanism that lets the model generate and refine its own captions. Specifically, the baseline model (Qwen-VL) first produces a high-level description with less unsafe details, sampled from its own distribution. The recaptioning model (Gemma 27B) then performs minimal edits to this caption by recovering the suppressed unsafe semantics, producing a caption with more unsafe details that preserves the original structure while adding explicit harmful descriptions. This paired supervision is then used to train the same model via both SFT and RL.
  • Figure 4: The statistics of the five policy levels in SafeEditBench, showing how the same image set is labeled differently under varying safety policies. From L1 (most permissive) to L5 (most restrictive), each policy defines different categories of violation. Policies L3 and L4 reflect widely accepted societal norms, while L1 and L5 represent most counterintuitive regimes designed to test policy adherence.
  • Figure 5: Examples showing that "safety" is fundamentally policy-dependent rather than common-sense–dependent. The same image may be judged “Safe" or “Unsafe" under different policies, especially when the policies adopt counterintuitive or non–common-sense definitions of safety (e.g., prohibiting ordinary affection while allowing sexually suggestive content). These examples highlight the core challenge: safety labels are not intrinsic to the image but are also determined by the specific policy applied.
  • ...and 2 more figures