Table of Contents
Fetching ...

Dual Memory Units with Uncertainty Regulation for Weakly Supervised Video Anomaly Detection

Hang Zhou, Junqing Yu, Wei Yang

TL;DR

This work tackles weakly supervised video anomaly detection by modeling both normal and abnormal patterns. It introduces the UR-DMU framework, combining Global and Local Multi-Head Self Attention, dual memory banks for normal and abnormal prototypes, and a Normal Data Uncertainty Learning module that imposes a Gaussian latent space for normal data. The method employs a multi-term loss to jointly learn discriminative abnormal features and robust normal-space representations, achieving state-of-the-art results on UCF-Crime and XD-Violence with strong ablation support. The approach yields improved robustness to noise and reduces false alarms in weakly labeled video settings, with practical impact on surveillance analytics and safety systems.

Abstract

Learning discriminative features for effectively separating abnormal events from normality is crucial for weakly supervised video anomaly detection (WS-VAD) tasks. Existing approaches, both video and segment-level label oriented, mainly focus on extracting representations for anomaly data while neglecting the implication of normal data. We observe that such a scheme is sub-optimal, i.e., for better distinguishing anomaly one needs to understand what is a normal state, and may yield a higher false alarm rate. To address this issue, we propose an Uncertainty Regulated Dual Memory Units (UR-DMU) model to learn both the representations of normal data and discriminative features of abnormal data. To be specific, inspired by the traditional global and local structure on graph convolutional networks, we introduce a Global and Local Multi-Head Self Attention (GL-MHSA) module for the Transformer network to obtain more expressive embeddings for capturing associations in videos. Then, we use two memory banks, one additional abnormal memory for tackling hard samples, to store and separate abnormal and normal prototypes and maximize the margins between the two representations. Finally, we propose an uncertainty learning scheme to learn the normal data latent space, that is robust to noise from camera switching, object changing, scene transforming, etc. Extensive experiments on XD-Violence and UCF-Crime datasets demonstrate that our method outperforms the state-of-the-art methods by a sizable margin.

Dual Memory Units with Uncertainty Regulation for Weakly Supervised Video Anomaly Detection

TL;DR

This work tackles weakly supervised video anomaly detection by modeling both normal and abnormal patterns. It introduces the UR-DMU framework, combining Global and Local Multi-Head Self Attention, dual memory banks for normal and abnormal prototypes, and a Normal Data Uncertainty Learning module that imposes a Gaussian latent space for normal data. The method employs a multi-term loss to jointly learn discriminative abnormal features and robust normal-space representations, achieving state-of-the-art results on UCF-Crime and XD-Violence with strong ablation support. The approach yields improved robustness to noise and reduces false alarms in weakly labeled video settings, with practical impact on surveillance analytics and safety systems.

Abstract

Learning discriminative features for effectively separating abnormal events from normality is crucial for weakly supervised video anomaly detection (WS-VAD) tasks. Existing approaches, both video and segment-level label oriented, mainly focus on extracting representations for anomaly data while neglecting the implication of normal data. We observe that such a scheme is sub-optimal, i.e., for better distinguishing anomaly one needs to understand what is a normal state, and may yield a higher false alarm rate. To address this issue, we propose an Uncertainty Regulated Dual Memory Units (UR-DMU) model to learn both the representations of normal data and discriminative features of abnormal data. To be specific, inspired by the traditional global and local structure on graph convolutional networks, we introduce a Global and Local Multi-Head Self Attention (GL-MHSA) module for the Transformer network to obtain more expressive embeddings for capturing associations in videos. Then, we use two memory banks, one additional abnormal memory for tackling hard samples, to store and separate abnormal and normal prototypes and maximize the margins between the two representations. Finally, we propose an uncertainty learning scheme to learn the normal data latent space, that is robust to noise from camera switching, object changing, scene transforming, etc. Extensive experiments on XD-Violence and UCF-Crime datasets demonstrate that our method outperforms the state-of-the-art methods by a sizable margin.
Paper Structure (17 sections, 11 equations, 4 figures, 6 tables)

This paper contains 17 sections, 11 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: We use two memory banks to store normal and anomaly instances respectively, both memory banks attract close instances. The normal space is regulated to be a Gaussian distribution, while no additional regulation applied for anomaly feature space. Such a strategy can tackle the hard samples better, the bottom curve shows prediction results of a part of the video "Black.Hawk.Down".
  • Figure 2: The framework of our UR-DMU model consists of three parts: global and local feature learning (GL-MHSA), dual memory units (DMU), and normal data uncertainty learning (NUL). GL-MHSA extracts better expressive embeddings, DMU stores both normal and abnormal patterns for discrimination, and NUL constrains the normal data as a Gaussian distribution to handle uncertainty perturbation.
  • Figure 3: The dual memory units store normal and anomaly instances in different memories. (a) shows that normal video feature goes through dual memory units and obtains relevant output scores and augment features. (b) is the case with abnormal video feature input.
  • Figure 4: Qualitative results of anomaly detection performances on UCF-Crime and XD-Violence. The visualizations of (a)$\thicksim$(d) are from XD-Violence, and (e)$\thicksim$(h) are from UCF-Crime.