Table of Contents
Fetching ...

Turns Out I'm Not Real: Towards Robust Detection of AI-Generated Videos

Qingyuan Liu, Pengyuan Shi, Yun-Yun Tsai, Chengzhi Mao, Junfeng Yang

TL;DR

This paper tackles the challenge of robustly detecting diffusion-generated videos across different toolchains. It introduces DIVID, a CNN+LSTM detector that fuses RGB frames with diffusion reconstruction error (DIRE) to capture temporal dynamics, and it analyzes the impact of diffusion sampling steps on detection signals. A new diffusion-video benchmark is built, including in-domain data from Stable Video Diffusion and out-domain data from SORA, Pika, and Gen-2. Experiments show that DIVID delivers strong in-domain accuracy and substantially improves out-domain generalization, underscoring the value of combining explicit diffusion knowledge with temporal modeling for video forensics.

Abstract

The impressive achievements of generative models in creating high-quality videos have raised concerns about digital integrity and privacy vulnerabilities. Recent works to combat Deepfakes videos have developed detectors that are highly accurate at identifying GAN-generated samples. However, the robustness of these detectors on diffusion-generated videos generated from video creation tools (e.g., SORA by OpenAI, Runway Gen-2, and Pika, etc.) is still unexplored. In this paper, we propose a novel framework for detecting videos synthesized from multiple state-of-the-art (SOTA) generative models, such as Stable Video Diffusion. We find that the SOTA methods for detecting diffusion-generated images lack robustness in identifying diffusion-generated videos. Our analysis reveals that the effectiveness of these detectors diminishes when applied to out-of-domain videos, primarily because they struggle to track the temporal features and dynamic variations between frames. To address the above-mentioned challenge, we collect a new benchmark video dataset for diffusion-generated videos using SOTA video creation tools. We extract representation within explicit knowledge from the diffusion model for video frames and train our detector with a CNN + LSTM architecture. The evaluation shows that our framework can well capture the temporal features between frames, achieves 93.7% detection accuracy for in-domain videos, and improves the accuracy of out-domain videos by up to 16 points.

Turns Out I'm Not Real: Towards Robust Detection of AI-Generated Videos

TL;DR

This paper tackles the challenge of robustly detecting diffusion-generated videos across different toolchains. It introduces DIVID, a CNN+LSTM detector that fuses RGB frames with diffusion reconstruction error (DIRE) to capture temporal dynamics, and it analyzes the impact of diffusion sampling steps on detection signals. A new diffusion-video benchmark is built, including in-domain data from Stable Video Diffusion and out-domain data from SORA, Pika, and Gen-2. Experiments show that DIVID delivers strong in-domain accuracy and substantially improves out-domain generalization, underscoring the value of combining explicit diffusion knowledge with temporal modeling for video forensics.

Abstract

The impressive achievements of generative models in creating high-quality videos have raised concerns about digital integrity and privacy vulnerabilities. Recent works to combat Deepfakes videos have developed detectors that are highly accurate at identifying GAN-generated samples. However, the robustness of these detectors on diffusion-generated videos generated from video creation tools (e.g., SORA by OpenAI, Runway Gen-2, and Pika, etc.) is still unexplored. In this paper, we propose a novel framework for detecting videos synthesized from multiple state-of-the-art (SOTA) generative models, such as Stable Video Diffusion. We find that the SOTA methods for detecting diffusion-generated images lack robustness in identifying diffusion-generated videos. Our analysis reveals that the effectiveness of these detectors diminishes when applied to out-of-domain videos, primarily because they struggle to track the temporal features and dynamic variations between frames. To address the above-mentioned challenge, we collect a new benchmark video dataset for diffusion-generated videos using SOTA video creation tools. We extract representation within explicit knowledge from the diffusion model for video frames and train our detector with a CNN + LSTM architecture. The evaluation shows that our framework can well capture the temporal features between frames, achieves 93.7% detection accuracy for in-domain videos, and improves the accuracy of out-domain videos by up to 16 points.
Paper Structure (9 sections, 8 equations, 3 figures, 3 tables)

This paper contains 9 sections, 8 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: We show the real video frames from YouTube, and fake video from SORA by OpenAI. The explicit knowledge, DIRE, is calculated by the difference between the input original frame and the reconstructed frame from the diffusion model. The reconstructed frame for the SORA-generated video will be visually close to the input original frame, yet the real video from YouTube can not (e.g., the cat face distorted after reconstruction), which inspired us to leverage the DIRE information for training.
  • Figure 2: The flow of DIVID. In step 1, given a sequence of video frames, we first generate the reconstructed version of every frame by using the diffusion model. Then, we calculate the DIRE values using the reconstructed frame and their corresponding input frame. In step 2, the CNN+LSTM detector is trained based on sequences of DIRE values and the original RGB frames.
  • Figure 3: Analysis on diffusion steps and ddim step for DIVID. The left-side subfigure shows the performance on different diffusion steps from 1k to 10k, we freeze the ddim step as 20 for all. The right-side subfigure shows the performance on different ddim steps from 5 to 50, and the diffusion step is fixed as 10k for all.