Frequency-Guided Diffusion Model with Perturbation Training for Skeleton-Based Video Anomaly Detection
Xiaofeng Tan, Hongsong Wang, Xin Geng, Liang Wang
TL;DR
The paper tackles open-set skeleton-based video anomaly detection by introducing a frequency-guided diffusion model with perturbation training (FG-Diff). It expands the learning of normal motion patterns through adversarial perturbations and shifts the reconstruction emphasis from fine-grained local details to the global motion structure by leveraging 2D-DCT with a DCT-Mask-based fusion during denoising. Empirical results across five benchmarks show state-of-the-art performance, highlighting improved robustness to unseen normal motions and strong generalization in diverse motion contexts. The work provides practical benefits for real-world VAD by combining perturbation-based robustness with frequency-guided reconstruction, enabling more reliable anomaly detection in open-set scenarios.
Abstract
Video anomaly detection (VAD) is a vital yet complex open-set task in computer vision, commonly tackled through reconstruction-based methods. However, these methods struggle with two key limitations: (1) insufficient robustness in open-set scenarios, where unseen normal motions are frequently misclassified as anomalies, and (2) an overemphasis on, but restricted capacity for, local motion reconstruction, which are inherently difficult to capture accurately due to their diversity. To overcome these challenges, we introduce a novel frequency-guided diffusion model with perturbation training. First, we enhance robustness by training a generator to produce perturbed samples, which are similar to normal samples and target the weakness of the reconstruction model. This training paradigm expands the reconstruction domain of the model, improving its generalization to unseen normal motions. Second, to address the overemphasis on motion details, we employ the 2D Discrete Cosine Transform (DCT) to separate high-frequency (local) and low-frequency (global) motion components. By guiding the diffusion model with observed high-frequency information, we prioritize the reconstruction of low-frequency components, enabling more accurate and robust anomaly detection. Extensive experiments on five widely used VAD datasets demonstrate that our approach surpasses state-of-the-art methods, underscoring its effectiveness in open-set scenarios and diverse motion contexts. Our project website is https://xiaofeng-tan.github.io/projects/FG-Diff/index.html.
