Table of Contents
Fetching ...

Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection

Shuhai Zhang, ZiHao Lian, Jiahao Yang, Daiyuan Li, Guoxuan Pang, Feng Liu, Bo Han, Shutao Li, Mingkui Tan

TL;DR

This work tackles the detection of AI-generated videos by introducing a physics-informed spatiotemporal statistic, the Normalized Spatiotemporal Gradient (NSG), grounded in probability flow conservation. By estimating NSG with diffusion-model score functions and a brightness-constancy-inspired temporal term, the authors build NSG-VD, a detector that uses MMD on NSG features to discriminate real versus generated videos. They provide theoretical bounds showing that NSG features magnify distribution shifts in generated content, and demonstrate substantial empirical gains over state-of-the-art baselines across standard and challenging data-imbalance settings. The approach offers a robust, artifact-agnostic detection framework that leverages fundamental physical dynamics, with practical performance improvements and scalable kernel-based discrimination. The work also discusses limitations, real-time considerations, and avenues for future physics-augmented detection methods.

Abstract

AI-generated videos have achieved near-perfect visual realism (e.g., Sora), urgently necessitating reliable detection mechanisms. However, detecting such videos faces significant challenges in modeling high-dimensional spatiotemporal dynamics and identifying subtle anomalies that violate physical laws. In this paper, we propose a physics-driven AI-generated video detection paradigm based on probability flow conservation principles. Specifically, we propose a statistic called Normalized Spatiotemporal Gradient (NSG), which quantifies the ratio of spatial probability gradients to temporal density changes, explicitly capturing deviations from natural video dynamics. Leveraging pre-trained diffusion models, we develop an NSG estimator through spatial gradients approximation and motion-aware temporal modeling without complex motion decomposition while preserving physical constraints. Building on this, we propose an NSG-based video detection method (NSG-VD) that computes the Maximum Mean Discrepancy (MMD) between NSG features of the test and real videos as a detection metric. Last, we derive an upper bound of NSG feature distances between real and generated videos, proving that generated videos exhibit amplified discrepancies due to distributional shifts. Extensive experiments confirm that NSG-VD outperforms state-of-the-art baselines by 16.00% in Recall and 10.75% in F1-Score, validating the superior performance of NSG-VD. The source code is available at https://github.com/ZSHsh98/NSG-VD.

Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection

TL;DR

This work tackles the detection of AI-generated videos by introducing a physics-informed spatiotemporal statistic, the Normalized Spatiotemporal Gradient (NSG), grounded in probability flow conservation. By estimating NSG with diffusion-model score functions and a brightness-constancy-inspired temporal term, the authors build NSG-VD, a detector that uses MMD on NSG features to discriminate real versus generated videos. They provide theoretical bounds showing that NSG features magnify distribution shifts in generated content, and demonstrate substantial empirical gains over state-of-the-art baselines across standard and challenging data-imbalance settings. The approach offers a robust, artifact-agnostic detection framework that leverages fundamental physical dynamics, with practical performance improvements and scalable kernel-based discrimination. The work also discusses limitations, real-time considerations, and avenues for future physics-augmented detection methods.

Abstract

AI-generated videos have achieved near-perfect visual realism (e.g., Sora), urgently necessitating reliable detection mechanisms. However, detecting such videos faces significant challenges in modeling high-dimensional spatiotemporal dynamics and identifying subtle anomalies that violate physical laws. In this paper, we propose a physics-driven AI-generated video detection paradigm based on probability flow conservation principles. Specifically, we propose a statistic called Normalized Spatiotemporal Gradient (NSG), which quantifies the ratio of spatial probability gradients to temporal density changes, explicitly capturing deviations from natural video dynamics. Leveraging pre-trained diffusion models, we develop an NSG estimator through spatial gradients approximation and motion-aware temporal modeling without complex motion decomposition while preserving physical constraints. Building on this, we propose an NSG-based video detection method (NSG-VD) that computes the Maximum Mean Discrepancy (MMD) between NSG features of the test and real videos as a detection metric. Last, we derive an upper bound of NSG feature distances between real and generated videos, proving that generated videos exhibit amplified discrepancies due to distributional shifts. Extensive experiments confirm that NSG-VD outperforms state-of-the-art baselines by 16.00% in Recall and 10.75% in F1-Score, validating the superior performance of NSG-VD. The source code is available at https://github.com/ZSHsh98/NSG-VD.

Paper Structure

This paper contains 40 sections, 10 theorems, 89 equations, 20 figures, 9 tables, 2 algorithms.

Key Result

Proposition 1

Under the brightness constancy assumption $p(\mathbf{x}+\Delta\mathbf{x}, t+\Delta t) \approx p(\mathbf{x}, t)$ with small inter-frame motion ($\Delta t \to 0$) and inter-frame displacement ($\Delta \mathbf{x} \to 0$), we have

Figures (20)

  • Figure 1: Comparisons of traditional and physics-driven paradigms for spatiotemporal modeling in AI-generated video detection. (a) Traditional methods amerini2019deepfakewang2023altfreezingxu2023tall often rely on specific artifacts like appearance consistency and optical flow-based motion modeling, struggling with highly realistic content yet physically implausible (e.g., Sora). (b) Our physics-driven approach explicitly models video dynamics via physics conservation laws, effectively identifying violations of physical laws.
  • Figure 2: Overview of the proposed NSG-VD. Given a reference set of real videos $\{\mathbf{x}^{re}\}$ and a test video $\mathbf{x}^{te}$, we estimate their spatial gradients $\nabla_{\mathbf{x}} \log p(\mathbf{x}, t)$ and temporal derivatives $\partial_t \log p(\mathbf{x}, t)$ via a pre-trained diffusion model $s_\theta$, from which we derive their Normalized Spatiotemporal Gradients (NSGs) and calculate the MMD between NSG features of real and test videos as a detection metric.
  • Figure 3: Impact of decision threshold.
  • Figure 4: Comparisons with baselines in terms of training costs and performance (%), where we train all models with $10, 000$ real and generated videos from Kinetics-400 and Pika, respectively.
  • Figure 5: Distribution of the values of temporal derivatives $\partial_t \log p(\mathbf{x}, t)$ in the NSG statistic across $10, 000$ real and generated videos from Kinetics-400 and SEINE, respectively.
  • ...and 15 more figures

Theorems & Definitions (21)

  • Definition 1
  • Proposition 1
  • Proposition 2
  • Theorem 1
  • Theorem 2
  • proof
  • Corollary 1
  • proof
  • proof
  • Proposition 3
  • ...and 11 more