Table of Contents
Fetching ...

VADMamba: Exploring State Space Models for Fast Video Anomaly Detection

Jiahao Lyu, Minghua Zhao, Jing Hu, Xuewen Huang, Yifei Chen, Shuangli Du

TL;DR

VADMamba addresses the speed-accuracy trade-off in video anomaly detection by leveraging Mamba state-space models for long-range temporal modeling. It introduces VQ-MaU, a vector-quantized Unet with Non-negative Vision State Space blocks, enabling fast feature aggregation, along with a dual-task setup of frame prediction and optical flow reconstruction. A clip-level fusion strategy combines appearance and motion cues to improve anomaly discrimination. Across Ped2, Avenue, and ShanghaiTech, VADMamba delivers strong inference speed while achieving competitive or superior detection performance, demonstrating the practicality of Mamba-based VAD for real-time surveillance.

Abstract

Video anomaly detection (VAD) methods are mostly CNN-based or Transformer-based, achieving impressive results, but the focus on detection accuracy often comes at the expense of inference speed. The emergence of state space models in computer vision, exemplified by the Mamba model, demonstrates improved computational efficiency through selective scans and showcases the great potential for long-range modeling. Our study pioneers the application of Mamba to VAD, dubbed VADMamba, which is based on multi-task learning for frame prediction and optical flow reconstruction. Specifically, we propose the VQ-Mamba Unet (VQ-MaU) framework, which incorporates a Vector Quantization (VQ) layer and Mamba-based Non-negative Visual State Space (NVSS) block. Furthermore, two individual VQ-MaU networks separately predict frames and reconstruct corresponding optical flows, further boosting accuracy through a clip-level fusion evaluation strategy. Experimental results validate the efficacy of the proposed VADMamba across three benchmark datasets, demonstrating superior performance in inference speed compared to previous work. Code is available at https://github.com/jLooo/VADMamba.

VADMamba: Exploring State Space Models for Fast Video Anomaly Detection

TL;DR

VADMamba addresses the speed-accuracy trade-off in video anomaly detection by leveraging Mamba state-space models for long-range temporal modeling. It introduces VQ-MaU, a vector-quantized Unet with Non-negative Vision State Space blocks, enabling fast feature aggregation, along with a dual-task setup of frame prediction and optical flow reconstruction. A clip-level fusion strategy combines appearance and motion cues to improve anomaly discrimination. Across Ped2, Avenue, and ShanghaiTech, VADMamba delivers strong inference speed while achieving competitive or superior detection performance, demonstrating the practicality of Mamba-based VAD for real-time surveillance.

Abstract

Video anomaly detection (VAD) methods are mostly CNN-based or Transformer-based, achieving impressive results, but the focus on detection accuracy often comes at the expense of inference speed. The emergence of state space models in computer vision, exemplified by the Mamba model, demonstrates improved computational efficiency through selective scans and showcases the great potential for long-range modeling. Our study pioneers the application of Mamba to VAD, dubbed VADMamba, which is based on multi-task learning for frame prediction and optical flow reconstruction. Specifically, we propose the VQ-Mamba Unet (VQ-MaU) framework, which incorporates a Vector Quantization (VQ) layer and Mamba-based Non-negative Visual State Space (NVSS) block. Furthermore, two individual VQ-MaU networks separately predict frames and reconstruct corresponding optical flows, further boosting accuracy through a clip-level fusion evaluation strategy. Experimental results validate the efficacy of the proposed VADMamba across three benchmark datasets, demonstrating superior performance in inference speed compared to previous work. Code is available at https://github.com/jLooo/VADMamba.

Paper Structure

This paper contains 8 sections, 9 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison of inference speed (FPS) and frame-level AUC (%) on Ped2. VADMamba demonstrates state-of-the-art performance in terms of FPS.
  • Figure 2: Overview of the proposed VADMamba. (a) The training and inference process of VADMamba. (b) The framework of the proposed VQ-MaU. (c) Non-negative Vision State Space block. The dashed line indicates that addition is used in the second loop. (d) Vision State-Space (VSS) with SS2D.
  • Figure 3: Visualization examples of FP and FR. From top to bottom, we show ground truth (GT), predicted frames (P), reconstructed optical flows (R), and error maps (Error). In the error map, brighter color indicates larger errors. The objects remarked with red/green borders are the anomaly/normal events.
  • Figure 4: Anomaly score curves for six examples. Red regions indicate anomalous events, with larger values indicating a greater likelihood of anomalies.