Table of Contents
Fetching ...

Enhance Multi-Scale Spatial-Temporal Coherence for Configurable Video Anomaly Detection

Kai Cheng, Xinzhe Li, Lijuan Che

TL;DR

The work tackles the challenge of varying detection demands in unsupervised video anomaly detection by introducing Degree of Tolerance (DoT) and a configurable two-tier CVAD architecture. It couples a stack-and-block design with Multi-Scale Memory with Selective Mechanism (MS$^2$M) to model spatial-temporal coherence across multiple scales, enabling rapid adaptation to new DoTs by freezing existing blocks and adding new ones. The MS$^2$M module employs depth-wise convolutions for receptive-field growth and attention-based memory reading/writing to memorize and refine normal patterns, optimized by a joint reconstruction and DoT-aware loss. Experiments on three standard benchmarks and a DoT-extended Ped2 dataset demonstrate state-of-the-art performance and substantial training-time savings, highlighting CVAD's practical impact for configurable, resource-efficient VAD in dynamic environments.

Abstract

The development of unsupervised Video Anomaly Detection (VAD) relies on technologies in the field of signal processing. Since the anomaly is quite ambiguous and unbounded, different detection demands may often be raised even in one scenario. Thus, we propose to design the configurable VAD with flexible solutions targeting to solve the issue that previous methods have to train their models from scratch and waste resources when detection demands even change slightly. Moreover, we also design a dataset with good compatibility to evaluate the VAD performance when changes happen in detection demands. Besides, videos contain important information regarding continuous changes in the object's appearance and motion. Thus, we also propose a module to establish the multi-scale spatial-temporal coherence, which improves the accuracy and has the ability to dynamically adjust and accurately capture spatial-temporal normal patterns. Experiments show that our method not only models coherence effectively but also has better configurable ability.

Enhance Multi-Scale Spatial-Temporal Coherence for Configurable Video Anomaly Detection

TL;DR

The work tackles the challenge of varying detection demands in unsupervised video anomaly detection by introducing Degree of Tolerance (DoT) and a configurable two-tier CVAD architecture. It couples a stack-and-block design with Multi-Scale Memory with Selective Mechanism (MSM) to model spatial-temporal coherence across multiple scales, enabling rapid adaptation to new DoTs by freezing existing blocks and adding new ones. The MSM module employs depth-wise convolutions for receptive-field growth and attention-based memory reading/writing to memorize and refine normal patterns, optimized by a joint reconstruction and DoT-aware loss. Experiments on three standard benchmarks and a DoT-extended Ped2 dataset demonstrate state-of-the-art performance and substantial training-time savings, highlighting CVAD's practical impact for configurable, resource-efficient VAD in dynamic environments.

Abstract

The development of unsupervised Video Anomaly Detection (VAD) relies on technologies in the field of signal processing. Since the anomaly is quite ambiguous and unbounded, different detection demands may often be raised even in one scenario. Thus, we propose to design the configurable VAD with flexible solutions targeting to solve the issue that previous methods have to train their models from scratch and waste resources when detection demands even change slightly. Moreover, we also design a dataset with good compatibility to evaluate the VAD performance when changes happen in detection demands. Besides, videos contain important information regarding continuous changes in the object's appearance and motion. Thus, we also propose a module to establish the multi-scale spatial-temporal coherence, which improves the accuracy and has the ability to dynamically adjust and accurately capture spatial-temporal normal patterns. Experiments show that our method not only models coherence effectively but also has better configurable ability.
Paper Structure (10 sections, 9 equations, 5 figures, 2 tables)

This paper contains 10 sections, 9 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The overall architecture of configurable video anomaly detection with flexible solutions. (a) and (b) represent the stack and block level, respectively.
  • Figure 2: The architecture of the multi-scale memory with selective mechanism.
  • Figure 3: Illustration of the difference of the predicted frames and ground truth.
  • Figure 4: Illustration of temporal localization results of the detection.
  • Figure 5: The qualitative analysis of hyperparameter $\tau$.