Table of Contents
Fetching ...

Appearance Blur-driven AutoEncoder and Motion-guided Memory Module for Video Anomaly Detection

Jiahao Lyu, Minghua Zhao, Jing Hu, Xuewen Huang, Shuangli Du, Cheng Shi, Zhiyong Lv

TL;DR

This work tackles cross-domain video anomaly detection (VAD) by reframing anomaly handling as a deblurring task: blurred appearance frames serve as pseudo-anomalies, and a Gaussian blur-driven autoencoder learns to deblur normal content while attention suppresses blurred real anomalies. A motion-guided memory module then records and retrieves normal motion distributions to enhance normality gaps, enabling zero-shot cross-dataset validation without target-domain fine-tuning. The method combines a dual-stream architecture with a motion encoder that uses zero convolutions, MRCA-based feature refinement, and an appearance-motion fusion module, optimized by a suite of losses including a PSNR-based anomaly scoring scheme. Experiments on Ped2, Avenue, and ShanghaiTech demonstrate state-of-the-art or competitive performance, with strong cross-dataset transfer and efficient testing since motion features are used only during training.

Abstract

Video anomaly detection (VAD) often learns the distribution of normal samples and detects the anomaly through measuring significant deviations, but the undesired generalization may reconstruct a few anomalies thus suppressing the deviations. Meanwhile, most VADs cannot cope with cross-dataset validation for new target domains, and few-shot methods must laboriously rely on model-tuning from the target domain to complete domain adaptation. To address these problems, we propose a novel VAD method with a motion-guided memory module to achieve cross-dataset validation with zero-shot. First, we add Gaussian blur to the raw appearance images, thereby constructing the global pseudo-anomaly, which serves as the input to the network. Then, we propose multi-scale residual channel attention to deblur the pseudo-anomaly in normal samples. Next, memory items are obtained by recording the motion features in the training phase, which are used to retrieve the motion features from the raw information in the testing phase. Lastly, our method can ignore the blurred real anomaly through attention and rely on motion memory items to increase the normality gap between normal and abnormal motion. Extensive experiments on three benchmark datasets demonstrate the effectiveness of the proposed method. Compared with cross-domain methods, our method achieves competitive performance without adaptation during testing.

Appearance Blur-driven AutoEncoder and Motion-guided Memory Module for Video Anomaly Detection

TL;DR

This work tackles cross-domain video anomaly detection (VAD) by reframing anomaly handling as a deblurring task: blurred appearance frames serve as pseudo-anomalies, and a Gaussian blur-driven autoencoder learns to deblur normal content while attention suppresses blurred real anomalies. A motion-guided memory module then records and retrieves normal motion distributions to enhance normality gaps, enabling zero-shot cross-dataset validation without target-domain fine-tuning. The method combines a dual-stream architecture with a motion encoder that uses zero convolutions, MRCA-based feature refinement, and an appearance-motion fusion module, optimized by a suite of losses including a PSNR-based anomaly scoring scheme. Experiments on Ped2, Avenue, and ShanghaiTech demonstrate state-of-the-art or competitive performance, with strong cross-dataset transfer and efficient testing since motion features are used only during training.

Abstract

Video anomaly detection (VAD) often learns the distribution of normal samples and detects the anomaly through measuring significant deviations, but the undesired generalization may reconstruct a few anomalies thus suppressing the deviations. Meanwhile, most VADs cannot cope with cross-dataset validation for new target domains, and few-shot methods must laboriously rely on model-tuning from the target domain to complete domain adaptation. To address these problems, we propose a novel VAD method with a motion-guided memory module to achieve cross-dataset validation with zero-shot. First, we add Gaussian blur to the raw appearance images, thereby constructing the global pseudo-anomaly, which serves as the input to the network. Then, we propose multi-scale residual channel attention to deblur the pseudo-anomaly in normal samples. Next, memory items are obtained by recording the motion features in the training phase, which are used to retrieve the motion features from the raw information in the testing phase. Lastly, our method can ignore the blurred real anomaly through attention and rely on motion memory items to increase the normality gap between normal and abnormal motion. Extensive experiments on three benchmark datasets demonstrate the effectiveness of the proposed method. Compared with cross-domain methods, our method achieves competitive performance without adaptation during testing.
Paper Structure (21 sections, 18 equations, 11 figures, 4 tables)

This paper contains 21 sections, 18 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Typical solutions of memory modules in video anomaly detection. Left: Most memory modules reconstruct entire appearance features, but are limited by the memory item size. Middle: A few memory modules reconstruct motion features, but are limited to RoI bounding boxes. Right: The new memory module proposed is not limited by the above shortcomings. By retrieving background-independent motion features, it is simple to implement VAD and cross-domain detection.
  • Figure 2: Overview framework of the proposed method. It employs a dual-stream AE with input Gaussian blur appearance images $B_{1:t}$ and motion images $O_{1:t}$ to output a predicted image $\hat{I}_{t+1}$, and consists of skip connections with MRCA, a motion-guided memory module, an appearance motion fusion module. During testing, just input blurred appearance images. The horizontal dimension indicates the number of output channels. H and W denote the height and width of features, respectively.
  • Figure 3: Multi-scale residual channel attention.
  • Figure 4: Motion-guided memory module. c: Cosine similarities, s: Softmax function. See text for details.
  • Figure 5: Different input images of three datasets. From top to bottom are UCSD Ped2, CUHK Avenue, and ShanghaiTech.
  • ...and 6 more figures