Table of Contents
Fetching ...

DBINDS -- Can Initial Noise from Diffusion Model Inversion Help Reveal AI-Generated Videos?

Yanlin Wu, Xiaogang Yuan, Dezhi An

TL;DR

This work tackles the challenge of detecting AI-generated videos in the face of cross-generator variability by shifting detection from pixel-domain cues to latent-space dynamics. It introduces DBINDS, a diffusion-model inversion–based detector that builds an Initial Noise Difference Sequence (INDS) from per-frame inversions and extracts multi-domain features to distinguish real from generated content. A LightGBM classifier, optimized with Bayesian (TPE) hyperparameters, fuses features across spatiotemporal, frequency, statistical, and texture domains, achieving strong cross-generator generalization under a few-shot training regime on GenVidBench. The approach demonstrates robust performance with practical deployment considerations, offering a complementary latent-space perspective to existing detectors and paving the way for scalable, resource-efficient AI-generated video forensics.

Abstract

AI-generated video has advanced rapidly and poses serious challenges to content security and forensic analysis. Existing detectors rely mainly on pixel-level visual cues and generalize poorly to unseen generators. We propose DBINDS, a diffusion-model-inversion based detector that analyzes latent-space dynamics rather than pixels. We find that initial noise sequences recovered by diffusion inversion differ systematically between real and generated videos. Building on this, DBINDS forms an Initial Noise Difference Sequence (INDS) and extracts multi-domain, multi-scale features. With feature optimization and a LightGBM classifier tuned by Bayesian search, DBINDS (trained on a single generator) achieves strong cross-generator performance on GenVidBench, demonstrating good generalization and robustness in limited-data settings.

DBINDS -- Can Initial Noise from Diffusion Model Inversion Help Reveal AI-Generated Videos?

TL;DR

This work tackles the challenge of detecting AI-generated videos in the face of cross-generator variability by shifting detection from pixel-domain cues to latent-space dynamics. It introduces DBINDS, a diffusion-model inversion–based detector that builds an Initial Noise Difference Sequence (INDS) from per-frame inversions and extracts multi-domain features to distinguish real from generated content. A LightGBM classifier, optimized with Bayesian (TPE) hyperparameters, fuses features across spatiotemporal, frequency, statistical, and texture domains, achieving strong cross-generator generalization under a few-shot training regime on GenVidBench. The approach demonstrates robust performance with practical deployment considerations, offering a complementary latent-space perspective to existing detectors and paving the way for scalable, resource-efficient AI-generated video forensics.

Abstract

AI-generated video has advanced rapidly and poses serious challenges to content security and forensic analysis. Existing detectors rely mainly on pixel-level visual cues and generalize poorly to unseen generators. We propose DBINDS, a diffusion-model-inversion based detector that analyzes latent-space dynamics rather than pixels. We find that initial noise sequences recovered by diffusion inversion differ systematically between real and generated videos. Building on this, DBINDS forms an Initial Noise Difference Sequence (INDS) and extracts multi-domain, multi-scale features. With feature optimization and a LightGBM classifier tuned by Bayesian search, DBINDS (trained on a single generator) achieves strong cross-generator performance on GenVidBench, demonstrating good generalization and robustness in limited-data settings.

Paper Structure

This paper contains 31 sections, 12 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Limitations of current generators
  • Figure 2: INDS feature‑analysis pipeline in DBINDS. The pipeline first performs data preprocessing (cropping) and VAE encoding. In latent space, O‑BELM–based inversion is applied and difference calculation is conducted to derive the Initial Noise Difference Sequence (INDS). From INDS, features are extracted in five domains in parallel: spatiotemporal features, frequency‑domain features, statistical‑domain features, texture‑and‑dimensionality‑reduction features, and channel‑interaction features. An intelligent screening module then applies variance‑threshold filtering and random‑forest–based importance screening, with a generalization‑evaluation loop; the retained features are further combined via a fixed feature‑selection step and a cross‑enhancement module to form more discriminative representations. Finally, the selected features are fed into a cost‑sensitive (biased) LightGBM classifier to decide whether the video is real or AI‑generated.
  • Figure 3: The PCA visualization distribution under the full feature combination. (a) PCA of training scenarios (Vript vs. Pika). (b) Feature distributions across multiple generators.
  • Figure 4: Trends in Model Performance under Various Perturbation Conditions
  • Figure 5: Visualization of the distribution changes in PCA based on the full feature combination at different inversion steps. The first row presents the scatter plots of the training samples Pika and Vript(real videos), while the second row shows the scatter plots of Vript and the seven unseen generated samples in the test set.