Table of Contents
Fetching ...

Seeing What Matters: Generalizable AI-generated Video Detection with Forensic-Oriented Augmentation

Riccardo Corvi, Davide Cozzolino, Ekta Prashnani, Shalini De Mello, Koki Nagano, Luisa Verdoliva

TL;DR

The paper tackles the generalization gap in AI-generated video detection by steering models toward generation-specific, low-level forensic traces rather than semantic cues. It introduces two training-time augmentations: artificial fake videos produced by reconstructing real content with a video autoencoder to embed generation artifacts, and wavelet-band replacement (WaveRep) that swaps low-frequency bands to force reliance on mid-high diagonal frequencies. Using a single generator for training, the approach achieves state-of-the-art generalization across 16 generators, with average balanced accuracy around 94% and robust performance against recent models like NOVA and FLUX. The findings emphasize data-centric priors and simple augmentation over architectural complexity for practical forensic detectors.

Abstract

Synthetic video generation is progressing very rapidly. The latest models can produce very realistic high-resolution videos that are virtually indistinguishable from real ones. Although several video forensic detectors have been recently proposed, they often exhibit poor generalization, which limits their applicability in a real-world scenario. Our key insight to overcome this issue is to guide the detector towards *seeing what really matters*. In fact, a well-designed forensic classifier should focus on identifying intrinsic low-level artifacts introduced by a generative architecture rather than relying on high-level semantic flaws that characterize a specific model. In this work, first, we study different generative architectures, searching and identifying discriminative features that are unbiased, robust to impairments, and shared across models. Then, we introduce a novel forensic-oriented data augmentation strategy based on the wavelet decomposition and replace specific frequency-related bands to drive the model to exploit more relevant forensic cues. Our novel training paradigm improves the generalizability of AI-generated video detectors, without the need for complex algorithms and large datasets that include multiple synthetic generators. To evaluate our approach, we train the detector using data from a single generative model and test it against videos produced by a wide range of other models. Despite its simplicity, our method achieves a significant accuracy improvement over state-of-the-art detectors and obtains excellent results even on very recent generative models, such as NOVA and FLUX.

Seeing What Matters: Generalizable AI-generated Video Detection with Forensic-Oriented Augmentation

TL;DR

The paper tackles the generalization gap in AI-generated video detection by steering models toward generation-specific, low-level forensic traces rather than semantic cues. It introduces two training-time augmentations: artificial fake videos produced by reconstructing real content with a video autoencoder to embed generation artifacts, and wavelet-band replacement (WaveRep) that swaps low-frequency bands to force reliance on mid-high diagonal frequencies. Using a single generator for training, the approach achieves state-of-the-art generalization across 16 generators, with average balanced accuracy around 94% and robust performance against recent models like NOVA and FLUX. The findings emphasize data-centric priors and simple augmentation over architectural complexity for practical forensic detectors.

Abstract

Synthetic video generation is progressing very rapidly. The latest models can produce very realistic high-resolution videos that are virtually indistinguishable from real ones. Although several video forensic detectors have been recently proposed, they often exhibit poor generalization, which limits their applicability in a real-world scenario. Our key insight to overcome this issue is to guide the detector towards *seeing what really matters*. In fact, a well-designed forensic classifier should focus on identifying intrinsic low-level artifacts introduced by a generative architecture rather than relying on high-level semantic flaws that characterize a specific model. In this work, first, we study different generative architectures, searching and identifying discriminative features that are unbiased, robust to impairments, and shared across models. Then, we introduce a novel forensic-oriented data augmentation strategy based on the wavelet decomposition and replace specific frequency-related bands to drive the model to exploit more relevant forensic cues. Our novel training paradigm improves the generalizability of AI-generated video detectors, without the need for complex algorithms and large datasets that include multiple synthetic generators. To evaluate our approach, we train the detector using data from a single generative model and test it against videos produced by a wide range of other models. Despite its simplicity, our method achieves a significant accuracy improvement over state-of-the-art detectors and obtains excellent results even on very recent generative models, such as NOVA and FLUX.

Paper Structure

This paper contains 17 sections, 7 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Synthetic video generators leave distinct traces, that are observed in the frequency spectrum. We leverage this observation to enhance their generalizability. To this end, we propose a novel training-time data augmentation strategy based on wavelet-bands that forces the model to learn the frequency components that best distinguish real from synthetic content. Fakes are also generated through video autoencoding to avoid semantic bias and to trick the model into exploiting low-level forensic traces left by the modern video generation architectures. Our training paradigm improves the generalizability of the detector without the need for complex algorithms and large datasets that include multiple generators.
  • Figure 2: From left to right: a real video from the dataset proposed in chen2024panda and synthetic videos generated using the same associated prompt "A scuba diver in the ocean surrounded by fishes" with Pyramid Flow, Mochi-1, Allegro and NOVA. For each of them we show its spatial power spectrum $S_{yx}(u,v)$ (bottom-right), and its temporal-spatial power spectra $S_{tx}(w,v)$ and $S_{yt}(u,w)$ (top-right and bottom-left).
  • Figure 3: Top: Spatial and temporal-spatial power spectra of OpenSora-Plan videos before and after compression, compared to those computed from a real (compressed) video. Bottom: close up of the power spectra presented on the top. Fourier-domain peaks due to video synthesis (forensic artifacts) are highlighted by circles. Peaks originated by compression are highlighted by red boxes. We can notice that after compression most of the peaks are reduced and compression traces (peaks concentrated along the horizontal and vertical directions) are visible in synthetic videos similar to real ones.
  • Figure 4: For each real video, four corresponding fake versions are generated: one with all fake subbands and others with selective replacements, such as the baseband or specific subbands substituted with their real counterparts. Notably, diagonal mid-to-high frequency subbands are never replaced: this is to teach the detector to do without the traces brought by one or the other low-frequency subbands and hence focus on such traces.
  • Figure 5: Comparison with SoTA methods on 16 generative models across different evaluation metrics.
  • ...and 6 more figures