Table of Contents
Fetching ...

SafeWatch: An Efficient Safety-Policy Following Video Guardrail Model with Transparent Explanations

Zhaorun Chen, Francesco Pinto, Minzhou Pan, Bo Li

TL;DR

SafeWatch tackles the need for scalable, policy-driven video guardrails by combining parallel policy encoding and policy-aware token pruning to reduce latency and mitigate bias, while delivering grounded explanations. It introduces PEPE to process safety policies in parallel and PAP to focus computation on policy-relevant video tokens, enabling zero-shot generalization to new policies. A large SafeWatch-Bench dataset with 2M videos and a multi-agent consensus annotation pipeline supports robust training and evaluation across real-world and Generative AI content. Empirical results show SafeWatch outperforms state-of-the-art baselines on SafeWatch-Bench and existing benchmarks, with improved explainability and reduced inference costs, signaling a practical path toward robust, transparent video moderation.

Abstract

With the rise of generative AI and rapid growth of high-quality video generation, video guardrails have become more crucial than ever to ensure safety and security across platforms. Current video guardrails, however, are either overly simplistic, relying on pure classification models trained on simple policies with limited unsafe categories, which lack detailed explanations, or prompting multimodal large language models (MLLMs) with long safety guidelines, which are inefficient and impractical for guardrailing real-world content. To bridge this gap, we propose SafeWatch, an efficient MLLM-based video guardrail model designed to follow customized safety policies and provide multi-label video guardrail outputs with content-specific explanations in a zero-shot manner. In particular, unlike traditional MLLM-based guardrails that encode all safety policies autoregressively, causing inefficiency and bias, SafeWatch uniquely encodes each policy chunk in parallel and eliminates their position bias such that all policies are attended simultaneously with equal importance. In addition, to improve efficiency and accuracy, SafeWatch incorporates a policy-aware visual token pruning algorithm that adaptively selects the most relevant video tokens for each policy, discarding noisy or irrelevant information. This allows for more focused, policy-compliant guardrail with significantly reduced computational overhead. Considering the limitations of existing video guardrail benchmarks, we propose SafeWatch-Bench, a large-scale video guardrail benchmark comprising over 2M videos spanning six safety categories which covers over 30 tasks to ensure a comprehensive coverage of all potential safety scenarios. SafeWatch outperforms SOTA by 28.2% on SafeWatch-Bench, 13.6% on benchmarks, cuts costs by 10%, and delivers top-tier explanations validated by LLM and human reviews.

SafeWatch: An Efficient Safety-Policy Following Video Guardrail Model with Transparent Explanations

TL;DR

SafeWatch tackles the need for scalable, policy-driven video guardrails by combining parallel policy encoding and policy-aware token pruning to reduce latency and mitigate bias, while delivering grounded explanations. It introduces PEPE to process safety policies in parallel and PAP to focus computation on policy-relevant video tokens, enabling zero-shot generalization to new policies. A large SafeWatch-Bench dataset with 2M videos and a multi-agent consensus annotation pipeline supports robust training and evaluation across real-world and Generative AI content. Empirical results show SafeWatch outperforms state-of-the-art baselines on SafeWatch-Bench and existing benchmarks, with improved explainability and reduced inference costs, signaling a practical path toward robust, transparent video moderation.

Abstract

With the rise of generative AI and rapid growth of high-quality video generation, video guardrails have become more crucial than ever to ensure safety and security across platforms. Current video guardrails, however, are either overly simplistic, relying on pure classification models trained on simple policies with limited unsafe categories, which lack detailed explanations, or prompting multimodal large language models (MLLMs) with long safety guidelines, which are inefficient and impractical for guardrailing real-world content. To bridge this gap, we propose SafeWatch, an efficient MLLM-based video guardrail model designed to follow customized safety policies and provide multi-label video guardrail outputs with content-specific explanations in a zero-shot manner. In particular, unlike traditional MLLM-based guardrails that encode all safety policies autoregressively, causing inefficiency and bias, SafeWatch uniquely encodes each policy chunk in parallel and eliminates their position bias such that all policies are attended simultaneously with equal importance. In addition, to improve efficiency and accuracy, SafeWatch incorporates a policy-aware visual token pruning algorithm that adaptively selects the most relevant video tokens for each policy, discarding noisy or irrelevant information. This allows for more focused, policy-compliant guardrail with significantly reduced computational overhead. Considering the limitations of existing video guardrail benchmarks, we propose SafeWatch-Bench, a large-scale video guardrail benchmark comprising over 2M videos spanning six safety categories which covers over 30 tasks to ensure a comprehensive coverage of all potential safety scenarios. SafeWatch outperforms SOTA by 28.2% on SafeWatch-Bench, 13.6% on benchmarks, cuts costs by 10%, and delivers top-tier explanations validated by LLM and human reviews.

Paper Structure

This paper contains 40 sections, 12 equations, 21 figures, 22 tables, 1 algorithm.

Figures (21)

  • Figure 1: An overview of SafeWatch. During data curation (top), we annotate each video in SafeWatch-Bench with high-quality multi-label guardrail and explanation via a multi-agent propose-discuss consensus pipeline, i.e., we guide multiple MLLMs to iteratively improve their annotation for each video frame by reaching consensus with each other. During training (bottom-left), SafeWatch distills knowledge from SafeWatch-Bench via three consecutive training stages to improve 1) the overall guardrail performance, 2) the adaptability to visual token pruning, and 3) the quality of explanation, respectively. During inference (bottom-right), SafeWatch judges videos for safety alignment with a customized policy and provides a description, guardrail, and explanation.
  • Figure 2: SafeWatch-Bench dataset, with 2M videos in total, covers six comprehensive safety categories, where each is further divided into multiple fine-grained risk subcategories to address a wide range of safety scenarios. Notably, SafeWatch-Bench is split into the Real and GenAI subsets, which contain the challenging videos produced in real-world scenarios (left-side), and generative videos produced by SOTA GenAI models (right-side), respectively. Specifically, each instance is annotated with multi-label guardrail labels and in-depth explanations using our pipeline.
  • Figure 3: The decoding pipeline of SafeWatch. Regarding video input (left), SafeWatch leverages a segmentation model to process the input video into clips based on unsafe events. Then, it samples frames from each event and encodes them into patch tokens. Regarding safety guidelines (right), SafeWatch encodes each policy in parallel with the equivalent RoPE embedding to ensure they are treated with equal importance. Then, for each policy, SafeWatch calculates the relevance score based on its cross attention with the video tokens and then activates Top-$k$ most informative tokens and prunes the rest. Finally these tokens are concatenated with the query for decoding.
  • Figure 4: Comparison of SafeWatch and GPT-4o across fine-grained scenarios in SafeWatch-Bench. We evaluate the average accuracy per subcategory. Hard Benign refers to challenging benign samples that previous models often misclassify as harmful, resulting in high false positives.
  • Figure 5: Comparing the performance and inference cost of SafeWatch with SFT baseline and GPT-4o w.r.t. different pruning ratio (left), and the generalizability to new policies, additional inference cost w.r.t. the number of few-shot examples (right). Performance and inference cost is evaluated by average accuracy, and average time per video, respectively.
  • ...and 16 more figures