Table of Contents
Fetching ...

A brief introduction to a framework named Multilevel Guidance-Exploration Network

Guoqing Yang, Zhiming Luo, Jianzhe Gao, Yingxin Lai, Kun Yang, Yifan He, Shaozi Li

TL;DR

The paper tackles unsupervised video anomaly detection by addressing the shortcomings of reconstruction/prediction-based methods, which can generalize too well and miss scene context. It introduces Multilevel Guidance-Exploration Network (MGENet), a two-level framework where a pre-trained Spatio-temporal Normalizing Flow guides an RGB encoder to learn motion representations, and the RGB encoder in turn guides a Mask Encoder to distill high-level appearance features, complemented by a Behavior-Scene Matching Module to capture scene-context relations. The approach employs an LSTA-based motion head and a masked visual modeling pathway, with a joint loss combining motion, appearance, and separability terms across memory modules, yielding a final anomaly score that integrates motion, appearance, and scene cues. Experiments on ShanghaiTech and UBnormal demonstrate state-of-the-art performance, underscoring the practical potential for robust, context-aware unsupervised anomaly detection in surveillance scenarios.

Abstract

Human behavior anomaly detection aims to identify unusual human actions, playing a crucial role in intelligent surveillance and other areas. The current mainstream methods still adopt reconstruction or future frame prediction techniques. However, reconstructing or predicting low-level pixel features easily enables the network to achieve overly strong generalization ability, allowing anomalies to be reconstructed or predicted as effectively as normal data. Different from their methods, inspired by the Student-Teacher Network, we propose a novel framework called the Multilevel Guidance-Exploration Network(MGENet), which detects anomalies through the difference in high-level representation between the Guidance and Exploration network. Specifically, we first utilize the pre-trained Normalizing Flow that takes skeletal keypoints as input to guide an RGB encoder, which takes unmasked RGB frames as input, to explore motion latent features. Then, the RGB encoder guides the mask encoder, which takes masked RGB frames as input, to explore the latent appearance feature. Additionally, we design a Behavior-Scene Matching Module(BSMM) to detect scene-related behavioral anomalies. Extensive experiments demonstrate that our proposed method achieves state-of-the-art performance on ShanghaiTech and UBnormal datasets.

A brief introduction to a framework named Multilevel Guidance-Exploration Network

TL;DR

The paper tackles unsupervised video anomaly detection by addressing the shortcomings of reconstruction/prediction-based methods, which can generalize too well and miss scene context. It introduces Multilevel Guidance-Exploration Network (MGENet), a two-level framework where a pre-trained Spatio-temporal Normalizing Flow guides an RGB encoder to learn motion representations, and the RGB encoder in turn guides a Mask Encoder to distill high-level appearance features, complemented by a Behavior-Scene Matching Module to capture scene-context relations. The approach employs an LSTA-based motion head and a masked visual modeling pathway, with a joint loss combining motion, appearance, and separability terms across memory modules, yielding a final anomaly score that integrates motion, appearance, and scene cues. Experiments on ShanghaiTech and UBnormal demonstrate state-of-the-art performance, underscoring the practical potential for robust, context-aware unsupervised anomaly detection in surveillance scenarios.

Abstract

Human behavior anomaly detection aims to identify unusual human actions, playing a crucial role in intelligent surveillance and other areas. The current mainstream methods still adopt reconstruction or future frame prediction techniques. However, reconstructing or predicting low-level pixel features easily enables the network to achieve overly strong generalization ability, allowing anomalies to be reconstructed or predicted as effectively as normal data. Different from their methods, inspired by the Student-Teacher Network, we propose a novel framework called the Multilevel Guidance-Exploration Network(MGENet), which detects anomalies through the difference in high-level representation between the Guidance and Exploration network. Specifically, we first utilize the pre-trained Normalizing Flow that takes skeletal keypoints as input to guide an RGB encoder, which takes unmasked RGB frames as input, to explore motion latent features. Then, the RGB encoder guides the mask encoder, which takes masked RGB frames as input, to explore the latent appearance feature. Additionally, we design a Behavior-Scene Matching Module(BSMM) to detect scene-related behavioral anomalies. Extensive experiments demonstrate that our proposed method achieves state-of-the-art performance on ShanghaiTech and UBnormal datasets.
Paper Structure (12 sections, 16 equations, 4 figures)

This paper contains 12 sections, 16 equations, 4 figures.

Figures (4)

  • Figure 1: Comparison of different methods using various features. (a) Reconstruction-based method, using the autoencoder to reconstruct the previous $T$ frames $f^{1:t}$. (b) Prediction-based method, predicting the $t+1$ frame $f^{t+1}$ from the prior $T$ frames. Both of them detect anomalies based on reconstruction or prediction errors. (c) Our Multilevel Guidance-Exploration framework, includes two similar levels. For instance, in the $1$-st level, Encoder-B learns another type of feature under the guidance of a pre-trained network (Encoder-A), detecting anomalies based on the similarity of latent output features.
  • Figure 2: The overall framework of our method.
  • Figure 3: The framework of LSTA.
  • Figure 4: Calculation process of (a) scene-related anomaly score and (b) appearance anomaly score. Here, S represents similarity calculation,$\textit{MASK}$ and $\overline{\textit{MASK}}$ represent mutually opposite masks. Note that in figure(b), the two sets of masked images are sequentially processed through the Mask Encoder