DBINDS -- Can Initial Noise from Diffusion Model Inversion Help Reveal AI-Generated Videos?
Yanlin Wu, Xiaogang Yuan, Dezhi An
TL;DR
This work tackles the challenge of detecting AI-generated videos in the face of cross-generator variability by shifting detection from pixel-domain cues to latent-space dynamics. It introduces DBINDS, a diffusion-model inversion–based detector that builds an Initial Noise Difference Sequence (INDS) from per-frame inversions and extracts multi-domain features to distinguish real from generated content. A LightGBM classifier, optimized with Bayesian (TPE) hyperparameters, fuses features across spatiotemporal, frequency, statistical, and texture domains, achieving strong cross-generator generalization under a few-shot training regime on GenVidBench. The approach demonstrates robust performance with practical deployment considerations, offering a complementary latent-space perspective to existing detectors and paving the way for scalable, resource-efficient AI-generated video forensics.
Abstract
AI-generated video has advanced rapidly and poses serious challenges to content security and forensic analysis. Existing detectors rely mainly on pixel-level visual cues and generalize poorly to unseen generators. We propose DBINDS, a diffusion-model-inversion based detector that analyzes latent-space dynamics rather than pixels. We find that initial noise sequences recovered by diffusion inversion differ systematically between real and generated videos. Building on this, DBINDS forms an Initial Noise Difference Sequence (INDS) and extracts multi-domain, multi-scale features. With feature optimization and a LightGBM classifier tuned by Bayesian search, DBINDS (trained on a single generator) achieves strong cross-generator performance on GenVidBench, demonstrating good generalization and robustness in limited-data settings.
