Table of Contents
Fetching ...

Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods

Omer Ben Hayun, Roy Betser, Meir Yossef Levi, Levi Kassel, Guy Gilboa

Abstract

Following major advances in text and image generation, the video domain has surged, producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors generalize poorly to unseen generators, a critical drawback given the rapid emergence of new models. These challenges motivate zero-shot approaches, which avoid synthetic data and instead score content against real-data statistics, enabling training-free, model-agnostic detection. We introduce \emph{STALL}, a simple, training-free, theoretically justified detector that provides likelihood-based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework. We evaluate STALL on two public benchmarks and introduce ComGenVid, a new benchmark with state-of-the-art generative models. STALL consistently outperforms prior image- and video-based baselines. Code and data are available at https://omerbenhayun.github.io/stall-video.

Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods

Abstract

Following major advances in text and image generation, the video domain has surged, producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors generalize poorly to unseen generators, a critical drawback given the rapid emergence of new models. These challenges motivate zero-shot approaches, which avoid synthetic data and instead score content against real-data statistics, enabling training-free, model-agnostic detection. We introduce \emph{STALL}, a simple, training-free, theoretically justified detector that provides likelihood-based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework. We evaluate STALL on two public benchmarks and introduce ComGenVid, a new benchmark with state-of-the-art generative models. STALL consistently outperforms prior image- and video-based baselines. Code and data are available at https://omerbenhayun.github.io/stall-video.
Paper Structure (63 sections, 3 theorems, 15 equations, 20 figures, 13 tables, 6 algorithms)

This paper contains 63 sections, 3 theorems, 15 equations, 20 figures, 13 tables, 6 algorithms.

Key Result

Lemma 1

Let $U_d$ be uniform on $\mathbb{S}^{d-1}$ and fix $k\in\mathbb N$. Then

Figures (20)

  • Figure 1: Spatio-temporal likelihoods per video. Blue: real; red: fake (ComGenVid). Joint spatio-temporal likelihoods clearly separate real and fake videos; examples illustrate high/low spatial likelihood (frame realism) and temporal likelihood (motion naturalness).
  • Figure 2: Qualitative comparison of ZED, D3, and our method (STALL). Each row shows a video clip with natural or unnatural spatial and temporal behavior, together with the corresponding predictions. ZED (spatial-only) misses in cases dominated by temporal inconsistency; D3 (temporal-only) fails when spatial realism is misleading. STALL fuses spatial and temporal likelihoods, yielding robust detection when either modality alone is insufficient. Additional examples with more details are given in Supp. D.7.
  • Figure 3: Method overview. A video is split into frames and encoded into embeddings. The spatial branch scores the likelihood of each frame embedding; the temporal branch normalizes inter-frame differences and scores their likelihood analogously. The two scores are then fused into a unified measure that separates AI-generated from real videos.
  • Figure 4: Correlations among spatial and temporal aggregation methods. Values computed on VATEX wang2019vatex, which is not used in our evaluations. When all individual likelihood detectors perform reasonably well on the evaluated benchmarks (see Supp. Section D.2), lower correlations are desirable.
  • Figure 5: Temporal embedding coordinates. Raw coordinates of temporal embeddings (frame differences) are not Gaussian; after normalization, each coordinate is approximately Gaussian (full histogram comparison in Supp. B.2.1).
  • ...and 15 more figures

Theorems & Definitions (3)

  • Lemma 1: Maxwell-Poincaré diaconis1984asymptotics
  • Theorem 2: Maxwell-Poincaré convergence ratediaconis1987dozen
  • Lemma 3: Maxwell–Poincaré with norm concentration