Table of Contents
Fetching ...

STNMamba: Mamba-based Spatial-Temporal Normality Learning for Video Anomaly Detection

Zhangxun Li, Mengyang Zhao, Xuan Yang, Yang Liu, Jiamu Sheng, Xinhua Zeng, Tian Wang, Kewei Wu, Yu-Gang Jiang

TL;DR

STNMamba introduces a Mamba-based framework for unsupervised video anomaly detection that jointly learns spatial and temporal normality with a lightweight architecture. It employs a dual-encoder design, featuring a Multi-Scale Vision Space State Block (MS-VSSB) for appearance and a Channel-Aware VSSB (CA-VSSB) for motion, connected via a Spatial-Temporal Interaction Module (STIM) that fuses multi-level features and memory-packed STFB blocks. A memory bank stores prototypical normal patterns to sharpen anomaly separation, while a Spatial-Temporal United Decoder predicts future frames for robust scoring. Across UCSD Ped2, CUHK Avenue, and ShanghaiTech, STNMamba achieves competitive frame-level AUC with fewer parameters and lower FLOPs, demonstrating practical efficiency for real-time surveillance scenarios.

Abstract

Video anomaly detection (VAD) has been extensively researched due to its potential for intelligent video systems. However, most existing methods based on CNNs and transformers still suffer from substantial computational burdens and have room for improvement in learning spatial-temporal normality. Recently, Mamba has shown great potential for modeling long-range dependencies with linear complexity, providing an effective solution to the above dilemma. To this end, we propose a lightweight and effective Mamba-based network named STNMamba, which incorporates carefully designed Mamba modules to enhance the learning of spatial-temporal normality. Firstly, we develop a dual-encoder architecture, where the spatial encoder equipped with Multi-Scale Vision Space State Blocks (MS-VSSB) extracts multi-scale appearance features, and the temporal encoder employs Channel-Aware Vision Space State Blocks (CA-VSSB) to capture significant motion patterns. Secondly, a Spatial-Temporal Interaction Module (STIM) is introduced to integrate spatial and temporal information across multiple levels, enabling effective modeling of intrinsic spatial-temporal consistency. Within this module, the Spatial-Temporal Fusion Block (STFB) is proposed to fuse the spatial and temporal features into a unified feature space, and the memory bank is utilized to store spatial-temporal prototypes of normal patterns, restricting the model's ability to represent anomalies. Extensive experiments on three benchmark datasets demonstrate that our STNMamba achieves competitive performance with fewer parameters and lower computational costs than existing methods.

STNMamba: Mamba-based Spatial-Temporal Normality Learning for Video Anomaly Detection

TL;DR

STNMamba introduces a Mamba-based framework for unsupervised video anomaly detection that jointly learns spatial and temporal normality with a lightweight architecture. It employs a dual-encoder design, featuring a Multi-Scale Vision Space State Block (MS-VSSB) for appearance and a Channel-Aware VSSB (CA-VSSB) for motion, connected via a Spatial-Temporal Interaction Module (STIM) that fuses multi-level features and memory-packed STFB blocks. A memory bank stores prototypical normal patterns to sharpen anomaly separation, while a Spatial-Temporal United Decoder predicts future frames for robust scoring. Across UCSD Ped2, CUHK Avenue, and ShanghaiTech, STNMamba achieves competitive frame-level AUC with fewer parameters and lower FLOPs, demonstrating practical efficiency for real-time surveillance scenarios.

Abstract

Video anomaly detection (VAD) has been extensively researched due to its potential for intelligent video systems. However, most existing methods based on CNNs and transformers still suffer from substantial computational burdens and have room for improvement in learning spatial-temporal normality. Recently, Mamba has shown great potential for modeling long-range dependencies with linear complexity, providing an effective solution to the above dilemma. To this end, we propose a lightweight and effective Mamba-based network named STNMamba, which incorporates carefully designed Mamba modules to enhance the learning of spatial-temporal normality. Firstly, we develop a dual-encoder architecture, where the spatial encoder equipped with Multi-Scale Vision Space State Blocks (MS-VSSB) extracts multi-scale appearance features, and the temporal encoder employs Channel-Aware Vision Space State Blocks (CA-VSSB) to capture significant motion patterns. Secondly, a Spatial-Temporal Interaction Module (STIM) is introduced to integrate spatial and temporal information across multiple levels, enabling effective modeling of intrinsic spatial-temporal consistency. Within this module, the Spatial-Temporal Fusion Block (STFB) is proposed to fuse the spatial and temporal features into a unified feature space, and the memory bank is utilized to store spatial-temporal prototypes of normal patterns, restricting the model's ability to represent anomalies. Extensive experiments on three benchmark datasets demonstrate that our STNMamba achieves competitive performance with fewer parameters and lower computational costs than existing methods.
Paper Structure (33 sections, 19 equations, 7 figures, 4 tables)

This paper contains 33 sections, 19 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Performance comparisons with respect to FLOPs and Params on the UCSD Ped2 dataset. The size of the circles represents the model's FLOPs or parameters. Our STNMamba outperforms these methods while maintaining notably low model parameters and computational complexity.
  • Figure 2: Illustration of the main architectures in unsupervised VAD methods. Unlike existing (a) single-stream networks, (b) dual-stream networks that learn spatial and temporal patterns independently and detect anomalies from the two dimensions, and (c) two-stream networks that perform fusion at the bottleneck, (d) our proposed STNMamba integrates spatial and temporal information seamlessly at multiple levels to model inherent spatial-temporal consistency.
  • Figure 3: Overview of the proposed STNMamba in (a). The structures of Vision Space State Block, Multi-Scale VSS Block, and Channel-Aware VSS Block are illustrated in (b), (c), and (d), respectively. The proposed STNMamba contains a spatial encoder $\mathcal{E}_{s}$ for appearance encoding, a temporal encoder $\mathcal{E}_{t}$ for motion encoding, a Spatial-Temporal Interaction Module (STIM) for spatial-temporal consistency modeling, and a decoder $\mathcal{D}_{st}$ for decoding and predicting.
  • Figure 4: Structure of the proposed Spatial-Temporal Fusion Block (STFB).
  • Figure 5: Results of sensitivity analysis to hyperparameters $\tau$ (left) and $k$ (right) on the UCSD Ped2 and CUHK Avenue datasets.
  • ...and 2 more figures