Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods

Omer Ben Hayun; Roy Betser; Meir Yossef Levi; Levi Kassel; Guy Gilboa

Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods

Omer Ben Hayun, Roy Betser, Meir Yossef Levi, Levi Kassel, Guy Gilboa

Abstract

Following major advances in text and image generation, the video domain has surged, producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors generalize poorly to unseen generators, a critical drawback given the rapid emergence of new models. These challenges motivate zero-shot approaches, which avoid synthetic data and instead score content against real-data statistics, enabling training-free, model-agnostic detection. We introduce \emph{STALL}, a simple, training-free, theoretically justified detector that provides likelihood-based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework. We evaluate STALL on two public benchmarks and introduce ComGenVid, a new benchmark with state-of-the-art generative models. STALL consistently outperforms prior image- and video-based baselines. Code and data are available at https://omerbenhayun.github.io/stall-video.

Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods

Abstract

Paper Structure (63 sections, 3 theorems, 15 equations, 20 figures, 13 tables, 6 algorithms)

This paper contains 63 sections, 3 theorems, 15 equations, 20 figures, 13 tables, 6 algorithms.

Introduction
Background and Related work
Preliminaries
Whitening transform and Gaussian likelihood
Asymptotic Gaussian projections
Method: STALL
Spatial likelihood
Temporal likelihood
Unified score
Calibration set
Evaluations
Experimental settings
Results
Ablation study
Conclusion
...and 48 more sections

Key Result

Lemma 1

Let $U_d$ be uniform on $\mathbb{S}^{d-1}$ and fix $k\in\mathbb N$. Then

Figures (20)

Figure 1: Spatio-temporal likelihoods per video. Blue: real; red: fake (ComGenVid). Joint spatio-temporal likelihoods clearly separate real and fake videos; examples illustrate high/low spatial likelihood (frame realism) and temporal likelihood (motion naturalness).
Figure 2: Qualitative comparison of ZED, D3, and our method (STALL). Each row shows a video clip with natural or unnatural spatial and temporal behavior, together with the corresponding predictions. ZED (spatial-only) misses in cases dominated by temporal inconsistency; D3 (temporal-only) fails when spatial realism is misleading. STALL fuses spatial and temporal likelihoods, yielding robust detection when either modality alone is insufficient. Additional examples with more details are given in Supp. D.7.
Figure 3: Method overview. A video is split into frames and encoded into embeddings. The spatial branch scores the likelihood of each frame embedding; the temporal branch normalizes inter-frame differences and scores their likelihood analogously. The two scores are then fused into a unified measure that separates AI-generated from real videos.
Figure 4: Correlations among spatial and temporal aggregation methods. Values computed on VATEX wang2019vatex, which is not used in our evaluations. When all individual likelihood detectors perform reasonably well on the evaluated benchmarks (see Supp. Section D.2), lower correlations are desirable.
Figure 5: Temporal embedding coordinates. Raw coordinates of temporal embeddings (frame differences) are not Gaussian; after normalization, each coordinate is approximately Gaussian (full histogram comparison in Supp. B.2.1).
...and 15 more figures

Theorems & Definitions (3)

Lemma 1: Maxwell-Poincaré diaconis1984asymptotics
Theorem 2: Maxwell-Poincaré convergence ratediaconis1987dozen
Lemma 3: Maxwell–Poincaré with norm concentration

Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods

Abstract

Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods

Authors

Abstract

Table of Contents

Key Result

Figures (20)

Theorems & Definitions (3)