Table of Contents
Fetching ...

Towards Understanding Unsafe Video Generation

Yan Pang, Aiping Xiong, Yang Zhang, Tianhao Wang

TL;DR

This paper exposes safety risks in video diffusion models by generating an unsafe-video dataset from prompts sourced on 4chan and Lexica, classifying videos into five categories (Distorted/Weird, Terrifying, Pornographic, Violent/Bloody, Political) through clustering and thematic coding, and validating unsafe labels via 403 online participants to yield 937 unsafe videos. It then introduces Latent Variable Defense (LVD), a model-read defense that monitors intermediate latent states during the DDIM sampling process to detect and stop unsafe content early, achieving approximately 0.90 detection accuracy with up to 10x speedups across three open-source SOTA VGMs. The approach demonstrates strong robustness, generalization to adversarial prompts and image-to-video tasks, and interoperability with existing model-free and model-write defenses, outperforming prior image-domain defenses in efficiency and safety preservation. The work provides a practical, scalable framework for mitigating unsafe video generation and highlights the need for continued safety integration as VGMs scale in capability and accessibility.

Abstract

Video generation models (VGMs) have demonstrated the capability to synthesize high-quality output. It is important to understand their potential to produce unsafe content, such as violent or terrifying videos. In this work, we provide a comprehensive understanding of unsafe video generation. First, to confirm the possibility that these models could indeed generate unsafe videos, we choose unsafe content generation prompts collected from 4chan and Lexica, and three open-source SOTA VGMs to generate unsafe videos. After filtering out duplicates and poorly generated content, we created an initial set of 2112 unsafe videos from an original pool of 5607 videos. Through clustering and thematic coding analysis of these generated videos, we identify 5 unsafe video categories: Distorted/Weird, Terrifying, Pornographic, Violent/Bloody, and Political. With IRB approval, we then recruit online participants to help label the generated videos. Based on the annotations submitted by 403 participants, we identified 937 unsafe videos from the initial video set. With the labeled information and the corresponding prompts, we created the first dataset of unsafe videos generated by VGMs. We then study possible defense mechanisms to prevent the generation of unsafe videos. Existing defense methods in image generation focus on filtering either input prompt or output results. We propose a new approach called Latent Variable Defense (LVD), which works within the model's internal sampling process. LVD can achieve 0.90 defense accuracy while reducing time and computing resources by 10x when sampling a large number of unsafe prompts.

Towards Understanding Unsafe Video Generation

TL;DR

This paper exposes safety risks in video diffusion models by generating an unsafe-video dataset from prompts sourced on 4chan and Lexica, classifying videos into five categories (Distorted/Weird, Terrifying, Pornographic, Violent/Bloody, Political) through clustering and thematic coding, and validating unsafe labels via 403 online participants to yield 937 unsafe videos. It then introduces Latent Variable Defense (LVD), a model-read defense that monitors intermediate latent states during the DDIM sampling process to detect and stop unsafe content early, achieving approximately 0.90 detection accuracy with up to 10x speedups across three open-source SOTA VGMs. The approach demonstrates strong robustness, generalization to adversarial prompts and image-to-video tasks, and interoperability with existing model-free and model-write defenses, outperforming prior image-domain defenses in efficiency and safety preservation. The work provides a practical, scalable framework for mitigating unsafe video generation and highlights the need for continued safety integration as VGMs scale in capability and accessibility.

Abstract

Video generation models (VGMs) have demonstrated the capability to synthesize high-quality output. It is important to understand their potential to produce unsafe content, such as violent or terrifying videos. In this work, we provide a comprehensive understanding of unsafe video generation. First, to confirm the possibility that these models could indeed generate unsafe videos, we choose unsafe content generation prompts collected from 4chan and Lexica, and three open-source SOTA VGMs to generate unsafe videos. After filtering out duplicates and poorly generated content, we created an initial set of 2112 unsafe videos from an original pool of 5607 videos. Through clustering and thematic coding analysis of these generated videos, we identify 5 unsafe video categories: Distorted/Weird, Terrifying, Pornographic, Violent/Bloody, and Political. With IRB approval, we then recruit online participants to help label the generated videos. Based on the annotations submitted by 403 participants, we identified 937 unsafe videos from the initial video set. With the labeled information and the corresponding prompts, we created the first dataset of unsafe videos generated by VGMs. We then study possible defense mechanisms to prevent the generation of unsafe videos. Existing defense methods in image generation focus on filtering either input prompt or output results. We propose a new approach called Latent Variable Defense (LVD), which works within the model's internal sampling process. LVD can achieve 0.90 defense accuracy while reducing time and computing resources by 10x when sampling a large number of unsafe prompts.
Paper Structure (53 sections, 7 equations, 9 figures, 11 tables, 1 algorithm)

This paper contains 53 sections, 7 equations, 9 figures, 11 tables, 1 algorithm.

Figures (9)

  • Figure 1: Unlike previous model-free defense methods for image diffusion models, we proposed utilizing the DDIM sampler's deterministic characteristics and using the intermediate denoising outputs to assess whether the generated video is unsafe. See detailed description in \ref{['sec:Methodology']}.
  • Figure 2: Based on our thematic coding analysis, we identified five categories of unsafe videos from the generated videos. For each category, we selected the first frame of the representative videos to illustrate our findings. For the Pornographic videos, we add masks to cover the explicit sexual content.
  • Figure 3: AUC ROC scores for MagicTime yuan2024magictime, AnimateDiff guo2023animatediff, and VideoCrafter chen2024videocrafter2. The parameters $\eta$ and $\lambda$ were selected based on the highlighted configurations in \ref{['tab:evaluation']} (i.e., $\eta=5$ and $\lambda=1$ for MagicTime yuan2024magictime, $\eta=10$ and $\lambda=1$ for AnimateDiff guo2023animatediff, and $\eta=20$ and $\lambda=0.6$ for VideoCrafter chen2024videocrafter2). Note: The AUC ROC presented here is derived from the assessment of the entire $\mathsf{LVD}$. Therefore, when the $\eta$ value is small, the $\mathsf{LVD}$'s output (pred_value) tends to be quite monotonic (e.g., when $\eta=1$, pred_value$= \{0,1\}$). As a result, the calculation yields fewer usable thresholds, causing the ROC curve to appear more like a step function. Increasing the $\eta$ value includes more usable thresholds, which smoothens the ROC curve.
  • Figure 4: Observe the trends in TPR, TNR, and accuracy of $\mathsf{LVD}$ on MagicTime as $\eta$ increases under different $\lambda$ settings. When $\eta$ is small, we set $\lambda$ to $1$. As $\eta$ increases, a smaller $\lambda$ (i.e., $\lambda=0.3$) gets better detection results.
  • Figure 5: Accuracy, TPR, and TNR of $\mathsf{LVD}$ on Animate as $\eta$ increases under different $\lambda$ settings.
  • ...and 4 more figures