Table of Contents
Fetching ...

From Evaluation to Defense: Advancing Safety in Video Large Language Models

Yiwei Sun, Peiqi Jiang, Chuanbin Liu, Luohao Lin, Zhiying Lu, Hongtao Xie

TL;DR

The paper addresses the overlooked safety risks of video-based LLMs by introducing VideoSafetyBench (VSB-77k), the first large-scale, culturally diverse benchmark for video safety, with evaluation (VSB-Eval) and post-training (VSB-R1-46k) datasets. It quantifies the safety degradation caused by video inputs and proposes a dual-stage defense, VideoSafety-R1, consisting of Alarm Token-Guided Safety Fine-Tuning (AT-SFT) and Safety-Guided GRPO reinforcement learning, achieving substantial gains on VSB-Eval-HH and strong generalization to image-safety benchmarks. The findings demonstrate that active, multi-modal safety reasoning can restore and enhance safety alignment in Video LLMs, marking a shift from passive harm detection to proactive safety reasoning in dynamic multimodal contexts. Overall, the work provides a foundational, scalable framework to study and improve safety in video LLMs with practical benchmarks and post-training methods.

Abstract

While the safety risks of image-based large language models have been extensively studied, their video-based counterparts (Video LLMs) remain critically under-examined. To systematically study this problem, we introduce \textbf{VideoSafetyBench (VSB-77k) - the first large-scale, culturally diverse benchmark for Video LLM safety}, which compromises 77,646 video-query pairs and spans 19 principal risk categories across 10 language communities. \textit{We reveal that integrating video modality degrades safety performance by an average of 42.3\%, exposing systemic risks in multimodal attack exploitation.} To address this vulnerability, we propose \textbf{VideoSafety-R1}, a dual-stage framework achieving unprecedented safety gains through two innovations: (1) Alarm Token-Guided Safety Fine-Tuning (AT-SFT) injects learnable alarm tokens into visual and textual sequences, enabling explicit harm perception across modalities via multitask objectives. (2) Then, Safety-Guided GRPO enhances defensive reasoning through dynamic policy optimization with rule-based rewards derived from dual-modality verification. These components synergize to shift safety alignment from passive harm recognition to active reasoning. The resulting framework achieves a 65.1\% improvement on VSB-Eval-HH, and improves by 59.1\%, 44.3\%, and 15.0\% on the image safety datasets MMBench, VLGuard, and FigStep, respectively. \textit{Our codes are available in the supplementary materials.} \textcolor{red}{Warning: This paper contains examples of harmful language and videos, and reader discretion is recommended.}

From Evaluation to Defense: Advancing Safety in Video Large Language Models

TL;DR

The paper addresses the overlooked safety risks of video-based LLMs by introducing VideoSafetyBench (VSB-77k), the first large-scale, culturally diverse benchmark for video safety, with evaluation (VSB-Eval) and post-training (VSB-R1-46k) datasets. It quantifies the safety degradation caused by video inputs and proposes a dual-stage defense, VideoSafety-R1, consisting of Alarm Token-Guided Safety Fine-Tuning (AT-SFT) and Safety-Guided GRPO reinforcement learning, achieving substantial gains on VSB-Eval-HH and strong generalization to image-safety benchmarks. The findings demonstrate that active, multi-modal safety reasoning can restore and enhance safety alignment in Video LLMs, marking a shift from passive harm detection to proactive safety reasoning in dynamic multimodal contexts. Overall, the work provides a foundational, scalable framework to study and improve safety in video LLMs with practical benchmarks and post-training methods.

Abstract

While the safety risks of image-based large language models have been extensively studied, their video-based counterparts (Video LLMs) remain critically under-examined. To systematically study this problem, we introduce \textbf{VideoSafetyBench (VSB-77k) - the first large-scale, culturally diverse benchmark for Video LLM safety}, which compromises 77,646 video-query pairs and spans 19 principal risk categories across 10 language communities. \textit{We reveal that integrating video modality degrades safety performance by an average of 42.3\%, exposing systemic risks in multimodal attack exploitation.} To address this vulnerability, we propose \textbf{VideoSafety-R1}, a dual-stage framework achieving unprecedented safety gains through two innovations: (1) Alarm Token-Guided Safety Fine-Tuning (AT-SFT) injects learnable alarm tokens into visual and textual sequences, enabling explicit harm perception across modalities via multitask objectives. (2) Then, Safety-Guided GRPO enhances defensive reasoning through dynamic policy optimization with rule-based rewards derived from dual-modality verification. These components synergize to shift safety alignment from passive harm recognition to active reasoning. The resulting framework achieves a 65.1\% improvement on VSB-Eval-HH, and improves by 59.1\%, 44.3\%, and 15.0\% on the image safety datasets MMBench, VLGuard, and FigStep, respectively. \textit{Our codes are available in the supplementary materials.} \textcolor{red}{Warning: This paper contains examples of harmful language and videos, and reader discretion is recommended.}

Paper Structure

This paper contains 75 sections, 9 equations, 12 figures, 16 tables.

Figures (12)

  • Figure 1: Models' Defense Success Rate across 19 subcategories. We measure the performance of 14 model variants on VSB-Eval-HH. Higher performance indicates better safety of the model. The range of values in the figure is from 0% to 100%. Our VideoSafety-R1 attains the highest DSR on 18 subtypes and achieves sub-optimal on subtype Alcohol, Tobacco and Drugs.
  • Figure 2: Comparison of MLLM's Safety Benchmarks. We conduct a comparison of safety datasets in terms of test size (#Scale) and category diversity (#Cata.). VSB-Eval demonstrates notable strengths in both dimensions and is uniquely tailored for evaluating Video LLMs. Furthermore, we offer a broader range of ablation subsets along with extensive training data.
  • Figure 4: Framework of VideoSafetyBench (left) and VideoSafety-R1 (right). (a) VideoSafetyBench: VSB-77k is generated through filtered video collection, multi-agent-based multimodal annotation, and template-driven query generation. (b) VideoSafety-R1: safety post-training with AT-SFT and Safety-Guided GRPO, implementing risk-aware response generation through VSB-R1-46k.
  • Figure 5: Effects of FPS on VideoLLaMA3.
  • Figure 6: Effects of FPS on Qwen2.5 VL.
  • ...and 7 more figures