Table of Contents
Fetching ...

ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity

Hang Wang, Chao Shen, Lei Zhang, Zhi-Qi Cheng

Abstract

AI-generated videos (AIGVs) have achieved unprecedented photorealism, posing severe threats to digital forensics. Existing AIGV detectors focus mainly on localized artifacts or short-term temporal inconsistencies, thus often fail to capture the underlying generative logic governing global temporal evolution, limiting AIGV detection performance. In this paper, we identify a distinctive fingerprint in AIGVs, termed anomalous temporal self-similarity (ATSS). Unlike real videos that exhibit stochastic natural dynamics, AIGVs follow deterministic anchor-driven trajectories (e.g., text or image prompts), inducing unnaturally repetitive correlations across visual and semantic domains. To exploit this, we propose the ATSS method, a multimodal detection framework that exploits this insight via a triple-similarity representation and a cross-attentive fusion mechanism. Specifically, ATSS reconstructs semantic trajectories by leveraging frame-wise descriptions to construct visual, textual, and cross-modal similarity matrices, which jointly quantify the inherent temporal anomalies. These matrices are encoded by dedicated Transformer encoders and integrated via a bidirectional cross-attentive fusion module to effectively model intra- and inter-modal dynamics. Extensive experiments on four large-scale benchmarks, including GenVideo, EvalCrafter, VideoPhy, and VidProM, demonstrate that ATSS significantly outperforms state-of-the-art methods in terms of AP, AUC, and ACC metrics, exhibiting superior generalization across diverse video generation models. Code and models of ATSS will be released at https://github.com/hwang-cs-ime/ATSS.

ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity

Abstract

AI-generated videos (AIGVs) have achieved unprecedented photorealism, posing severe threats to digital forensics. Existing AIGV detectors focus mainly on localized artifacts or short-term temporal inconsistencies, thus often fail to capture the underlying generative logic governing global temporal evolution, limiting AIGV detection performance. In this paper, we identify a distinctive fingerprint in AIGVs, termed anomalous temporal self-similarity (ATSS). Unlike real videos that exhibit stochastic natural dynamics, AIGVs follow deterministic anchor-driven trajectories (e.g., text or image prompts), inducing unnaturally repetitive correlations across visual and semantic domains. To exploit this, we propose the ATSS method, a multimodal detection framework that exploits this insight via a triple-similarity representation and a cross-attentive fusion mechanism. Specifically, ATSS reconstructs semantic trajectories by leveraging frame-wise descriptions to construct visual, textual, and cross-modal similarity matrices, which jointly quantify the inherent temporal anomalies. These matrices are encoded by dedicated Transformer encoders and integrated via a bidirectional cross-attentive fusion module to effectively model intra- and inter-modal dynamics. Extensive experiments on four large-scale benchmarks, including GenVideo, EvalCrafter, VideoPhy, and VidProM, demonstrate that ATSS significantly outperforms state-of-the-art methods in terms of AP, AUC, and ACC metrics, exhibiting superior generalization across diverse video generation models. Code and models of ATSS will be released at https://github.com/hwang-cs-ime/ATSS.

Paper Structure

This paper contains 28 sections, 14 equations, 4 figures, 15 tables.

Figures (4)

  • Figure 1: Motivation of the proposed ATSS framework. Real videos are characterized by stochastic spatiotemporal dynamics, resulting in diffuse and low-magnitude self-similarity patterns. In contrast, AI-generated videos exhibit systemic temporal regularity caused by their anchor-driven generation, which yields denser and higher-intensity correlation matrices across multiple modalities. This anomalous self-similarity serves as a distinctive forensic fingerprint, enabling ATSS to effectively distinguish AIGVs from natural sequences.
  • Figure 2: The overall framework of ATSS. Given a video with $T$ sampled frames, an image captioning model is first employed to generate frame-wise textual descriptions. Visual and textual features are then extracted to construct visual, textual, and cross-modal self-similarity matrices. Each matrix is processed by a dedicated Transformer to capture temporal dynamics. Finally, a cross-attentive fusion module integrates these modality-specific cues into a unified representation for binary classification.
  • Figure 3: t-SNE visualizations of 10 subsets on the GenVideo dataset: (a) Crafter, (b) Gen2, (c) HotShot, (d) Lavie, (e) ModelScope, (f) MoonValley, (g) MorphStudio, (h) Show-1, (i) Sora, and (j) WildScrape.
  • Figure 4: Visualization of Attention Density Maps. Each row displays the attention weights for the visual, textual, and cross-modal branches of a real video and ten AI-generated samples, which are randomly selected from MSR-VTT and each generator subset of GenVideo, respectively.