Table of Contents
Fetching ...

AI-Generated Video Detection via Perceptual Straightening

Christian Internò, Robert Geirhos, Markus Olhofer, Sunny Liu, Barbara Hammer, David Klindt

TL;DR

This work tackles the rising challenge of AI-generated videos by proposing ReStraV, a fast, geometry-based detector that leverages perceptual straightening in neural representations. By analyzing temporal curvature and stepwise distance in DINOv2 embeddings over short frame windows, ReStraV captures discriminative differences between real and AI-created videos with a lightweight 21-feature classifier achieving near-SoTA performance and sub-50 ms latency. The key contributions include formalizing curvature-based representation signals, constructing robust trajectory statistics, and demonstrating strong generalization across generators, tasks, and even zero-shot scenarios on diverse benchmarks. The practical impact is a scalable, interpretable tool for content authentication that complements other detectors while offering neuroscience-inspired insights into how natural dynamics are encoded in neural representations.

Abstract

The rapid advancement of generative AI enables highly realistic synthetic videos, posing significant challenges for content authentication and raising urgent concerns about misuse. Existing detection methods often struggle with generalization and capturing subtle temporal inconsistencies. We propose ReStraV(Representation Straightening Video), a novel approach to distinguish natural from AI-generated videos. Inspired by the "perceptual straightening" hypothesis -- which suggests real-world video trajectories become more straight in neural representation domain -- we analyze deviations from this expected geometric property. Using a pre-trained self-supervised vision transformer (DINOv2), we quantify the temporal curvature and stepwise distance in the model's representation domain. We aggregate statistics of these measures for each video and train a classifier. Our analysis shows that AI-generated videos exhibit significantly different curvature and distance patterns compared to real videos. A lightweight classifier achieves state-of-the-art detection performance (e.g., 97.17% accuracy and 98.63% AUROC on the VidProM benchmark), substantially outperforming existing image- and video-based methods. ReStraV is computationally efficient, it is offering a low-cost and effective detection solution. This work provides new insights into using neural representation geometry for AI-generated video detection.

AI-Generated Video Detection via Perceptual Straightening

TL;DR

This work tackles the rising challenge of AI-generated videos by proposing ReStraV, a fast, geometry-based detector that leverages perceptual straightening in neural representations. By analyzing temporal curvature and stepwise distance in DINOv2 embeddings over short frame windows, ReStraV captures discriminative differences between real and AI-created videos with a lightweight 21-feature classifier achieving near-SoTA performance and sub-50 ms latency. The key contributions include formalizing curvature-based representation signals, constructing robust trajectory statistics, and demonstrating strong generalization across generators, tasks, and even zero-shot scenarios on diverse benchmarks. The practical impact is a scalable, interpretable tool for content authentication that complements other detectors while offering neuroscience-inspired insights into how natural dynamics are encoded in neural representations.

Abstract

The rapid advancement of generative AI enables highly realistic synthetic videos, posing significant challenges for content authentication and raising urgent concerns about misuse. Existing detection methods often struggle with generalization and capturing subtle temporal inconsistencies. We propose ReStraV(Representation Straightening Video), a novel approach to distinguish natural from AI-generated videos. Inspired by the "perceptual straightening" hypothesis -- which suggests real-world video trajectories become more straight in neural representation domain -- we analyze deviations from this expected geometric property. Using a pre-trained self-supervised vision transformer (DINOv2), we quantify the temporal curvature and stepwise distance in the model's representation domain. We aggregate statistics of these measures for each video and train a classifier. Our analysis shows that AI-generated videos exhibit significantly different curvature and distance patterns compared to real videos. A lightweight classifier achieves state-of-the-art detection performance (e.g., 97.17% accuracy and 98.63% AUROC on the VidProM benchmark), substantially outperforming existing image- and video-based methods. ReStraV is computationally efficient, it is offering a low-cost and effective detection solution. This work provides new insights into using neural representation geometry for AI-generated video detection.

Paper Structure

This paper contains 30 sections, 1 equation, 18 figures, 5 tables.

Figures (18)

  • Figure 1: The ReStraV method for AI-video detection. Inspired by "perceptual straightening," our approach leverages the geometric insight that natural videos form "straighter" feature trajectories ($z_i$) than generated ones. The temporal curvature (Eq. \ref{['eq:curvature_annot']}) serves as the discriminative signal for detection.
  • Figure 2: (A) In pixel space (left), video trajectory metrics (curvature, distance; see \ref{['eq:curvature_annot']} for details) between natural vs. AI-generated videos show substantial overlap. In contrast, DINOv2 representations (right) straighten natural trajectories, clearly separating natural and AI-generated videos. (B)The mean curvature gap ($\Delta\theta$) between AI-generated and natural videos across various visual encoders. HVS-inspired models (red) exhibit negative deltas, straightening both natural and AI videos equally, while SSL models (green), particularly DINOv2, show the largest positive deltas.
  • Figure 3: ReStraV vs. VideoSwin liu2023tallswin fake video detection on VidProMwang2024vidprommillionscalerealpromptgallery. "Seen generators" are those included in training; "Unseen generators" and "Future generators" were excluded from training. $\uparrow$ is better.
  • Figure 4: t-SNE embeddings of curvature trajectories for 1,000 videos from the VideoProM dataset wang2024vidprommillionscalerealpromptgallery: 500 natural and 500 AI-generated (125 each from Pika pika, VideoCrafter2 DBLP4, Text2Video-Zero singer2022makeavideotexttovideogenerationtextvideo, and ModelScope wang2023modelscopetexttovideotechnicalreport; 24 frames/video). Left (Pixel Space): Natural and synthetic trajectories overlap significantly. Right (DINOv2 ViT-S/14 Representation Space): Trajectories clearly separate, with natural (blue) and AI-generated (shades of red) videos forming distinct clusters.
  • Figure 5: Distributions of aggregated temporal trajectory features (mean, min, max, variance) for natural and AI-generated videos, computed using DINOv2 ViT-S/14 representations. Top row: Temporal distance-based features ($d_i$). Bottom row: Corresponding curvature-based features ($\theta_i^\circ$). Both distance- and curvature-based features provide discriminative signal.
  • ...and 13 more figures