Table of Contents
Fetching ...

Frequency-Guided Diffusion Model with Perturbation Training for Skeleton-Based Video Anomaly Detection

Xiaofeng Tan, Hongsong Wang, Xin Geng, Liang Wang

TL;DR

The paper tackles open-set skeleton-based video anomaly detection by introducing a frequency-guided diffusion model with perturbation training (FG-Diff). It expands the learning of normal motion patterns through adversarial perturbations and shifts the reconstruction emphasis from fine-grained local details to the global motion structure by leveraging 2D-DCT with a DCT-Mask-based fusion during denoising. Empirical results across five benchmarks show state-of-the-art performance, highlighting improved robustness to unseen normal motions and strong generalization in diverse motion contexts. The work provides practical benefits for real-world VAD by combining perturbation-based robustness with frequency-guided reconstruction, enabling more reliable anomaly detection in open-set scenarios.

Abstract

Video anomaly detection (VAD) is a vital yet complex open-set task in computer vision, commonly tackled through reconstruction-based methods. However, these methods struggle with two key limitations: (1) insufficient robustness in open-set scenarios, where unseen normal motions are frequently misclassified as anomalies, and (2) an overemphasis on, but restricted capacity for, local motion reconstruction, which are inherently difficult to capture accurately due to their diversity. To overcome these challenges, we introduce a novel frequency-guided diffusion model with perturbation training. First, we enhance robustness by training a generator to produce perturbed samples, which are similar to normal samples and target the weakness of the reconstruction model. This training paradigm expands the reconstruction domain of the model, improving its generalization to unseen normal motions. Second, to address the overemphasis on motion details, we employ the 2D Discrete Cosine Transform (DCT) to separate high-frequency (local) and low-frequency (global) motion components. By guiding the diffusion model with observed high-frequency information, we prioritize the reconstruction of low-frequency components, enabling more accurate and robust anomaly detection. Extensive experiments on five widely used VAD datasets demonstrate that our approach surpasses state-of-the-art methods, underscoring its effectiveness in open-set scenarios and diverse motion contexts. Our project website is https://xiaofeng-tan.github.io/projects/FG-Diff/index.html.

Frequency-Guided Diffusion Model with Perturbation Training for Skeleton-Based Video Anomaly Detection

TL;DR

The paper tackles open-set skeleton-based video anomaly detection by introducing a frequency-guided diffusion model with perturbation training (FG-Diff). It expands the learning of normal motion patterns through adversarial perturbations and shifts the reconstruction emphasis from fine-grained local details to the global motion structure by leveraging 2D-DCT with a DCT-Mask-based fusion during denoising. Empirical results across five benchmarks show state-of-the-art performance, highlighting improved robustness to unseen normal motions and strong generalization in diverse motion contexts. The work provides practical benefits for real-world VAD by combining perturbation-based robustness with frequency-guided reconstruction, enabling more reliable anomaly detection in open-set scenarios.

Abstract

Video anomaly detection (VAD) is a vital yet complex open-set task in computer vision, commonly tackled through reconstruction-based methods. However, these methods struggle with two key limitations: (1) insufficient robustness in open-set scenarios, where unseen normal motions are frequently misclassified as anomalies, and (2) an overemphasis on, but restricted capacity for, local motion reconstruction, which are inherently difficult to capture accurately due to their diversity. To overcome these challenges, we introduce a novel frequency-guided diffusion model with perturbation training. First, we enhance robustness by training a generator to produce perturbed samples, which are similar to normal samples and target the weakness of the reconstruction model. This training paradigm expands the reconstruction domain of the model, improving its generalization to unseen normal motions. Second, to address the overemphasis on motion details, we employ the 2D Discrete Cosine Transform (DCT) to separate high-frequency (local) and low-frequency (global) motion components. By guiding the diffusion model with observed high-frequency information, we prioritize the reconstruction of low-frequency components, enabling more accurate and robust anomaly detection. Extensive experiments on five widely used VAD datasets demonstrate that our approach surpasses state-of-the-art methods, underscoring its effectiveness in open-set scenarios and diverse motion contexts. Our project website is https://xiaofeng-tan.github.io/projects/FG-Diff/index.html.

Paper Structure

This paper contains 30 sections, 1 theorem, 27 equations, 7 figures, 4 tables, 2 algorithms.

Key Result

Theorem 4.1

Given an observed motion $\mathbf{x}^o$, a perturbation generator $\mathcal{G}_\phi$ trained by Eq. (eq:5), and a neighborhood parameter $\lambda$, the generated perturbed motion $\hat{\mathbf{x}}^o$ obtained by Eq. (eq:x) and Eq. (eq:4) satisfies that:

Figures (7)

  • Figure 1: The data illustration. (a) The training and testing data, where the training data is composed of seen normal motions and the testing data contains unseen normal and abnormal motions. Although seen and unseen motions represent the same action (e.g., walking), their local details, such as stride length, arm swing amplitude, and joint angles, exhibit significant differences. (b) The frequency analyses of motions. This analysis reveals that a motion retaining only 70% of its low-frequency information remains largely similar to the original motion in terms of global structure, with minor differences observed in the low-frequency regions. Note that low-frequency and high-frequency regions do not correspond directly to specific joints. Instead, low-frequency regions are defined as areas where joints predominantly contain low-frequency information while also exhibiting a relatively higher proportion of high-frequency details.
  • Figure 2: Comparison between our proposed method (green) and existing methods (blue). During the training phase, we employ adversarial training for the perturbation generator and denoiser to enhance model robustness. Specifically, the perturbation generator attacks the observed motion, producing motions that are challenging to reconstruct yet resemble normal motions. These perturbed motions are then used to train the denoiser, thereby improving its robustness. During the inference phase, we apply DCT to separate observed motion into global and local components, represented as low-frequency and high-frequency information. By leveraging high-frequency information as guidance, our method can accurately reconstruct observed motion compared to existing methods.
  • Figure 3: The framework of the proposed method. The model is trained utilizing generated perturbation examples. The training phase includes two processes: minimizing the mean square error to train the noise predictor $\varepsilon_\theta$ and maximizing this error to train the perturbation generator $\mathcal{G}_\phi$. During the testing phase, the high-frequency information of observed motions and the low-frequency information of generated motions are fused for effective anomaly detection.
  • Figure 4: The illustration of perturbation training. In Fig. (a), the green and yellow points denote the original training $x_k$ and perturbed motion $\hat{x}_k$, respectively. The red region represents the distribution of unseen normal samples. Accordingly, Fig. (b) demonstrates that the reconstruction domain is extended by our proposed perturbation training.
  • Figure 5: The visualization of human motions processed by 2D-DCT. (a) original motions; (b) motions with low-frequency information only; (c) the comparison between (a) and (b); (d) the skeletal example. Note that the red lines in (d) denote the discarded high-frequency information, and the red circles represent the high-frequency joints w.r.t. temporal and spatial dimension.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 4.1: Effectiveness of perturbation generator
  • proof