Table of Contents
Fetching ...

VADMamba++: Efficient Video Anomaly Detection via Hybrid Modeling in Grayscale Space

Jihao Lyu, Minghua Zhao, Jing Hu, Yifei Chen, Shuangli Du, Cheng Shi

Abstract

VADMamba pioneered the introduction of Mamba to Video Anomaly Detection (VAD), achieving high accuracy and fast inference through hybrid proxy tasks. Nevertheless, its heavy reliance on optical flow as auxiliary input and inter-task fusion scoring constrains its applicability to a single proxy task. In this paper, we introduce VADMamba++, an efficient VAD method based on the Gray-to-RGB paradigm that enforces a Single-Channel to Three-Channel reconstruction mapping, designed for a single proxy task and operating without auxiliary inputs. This paradigm compels inferring color appearances from grayscale structures, allowing anomalies to be more effectively revealed through dual inconsistencies between structure and chromatic cues. Specifically, VADMamba++ reconstructs grayscale frames into the RGB space to simultaneously discriminate structural geometry and chromatic fidelity, thereby enhancing sensitivity to explicit visual anomalies. We further design a hybrid modeling backbone that integrates Mamba, CNN, and Transformer modules to capture diverse normal patterns while suppressing the appearance of anomalies. Furthermore, an intra-task fusion scoring strategy integrates explicit future-frame prediction errors with implicit quantized feature errors, further improving accuracy under a single task setting. Extensive experiments on three benchmark datasets demonstrate that VADMamba++ outperforms state-of-the-art methods while meeting performance and efficiency, especially under a strict single-task setting with only frame-level inputs.

VADMamba++: Efficient Video Anomaly Detection via Hybrid Modeling in Grayscale Space

Abstract

VADMamba pioneered the introduction of Mamba to Video Anomaly Detection (VAD), achieving high accuracy and fast inference through hybrid proxy tasks. Nevertheless, its heavy reliance on optical flow as auxiliary input and inter-task fusion scoring constrains its applicability to a single proxy task. In this paper, we introduce VADMamba++, an efficient VAD method based on the Gray-to-RGB paradigm that enforces a Single-Channel to Three-Channel reconstruction mapping, designed for a single proxy task and operating without auxiliary inputs. This paradigm compels inferring color appearances from grayscale structures, allowing anomalies to be more effectively revealed through dual inconsistencies between structure and chromatic cues. Specifically, VADMamba++ reconstructs grayscale frames into the RGB space to simultaneously discriminate structural geometry and chromatic fidelity, thereby enhancing sensitivity to explicit visual anomalies. We further design a hybrid modeling backbone that integrates Mamba, CNN, and Transformer modules to capture diverse normal patterns while suppressing the appearance of anomalies. Furthermore, an intra-task fusion scoring strategy integrates explicit future-frame prediction errors with implicit quantized feature errors, further improving accuracy under a single task setting. Extensive experiments on three benchmark datasets demonstrate that VADMamba++ outperforms state-of-the-art methods while meeting performance and efficiency, especially under a strict single-task setting with only frame-level inputs.

Paper Structure

This paper contains 9 sections, 18 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Comparison of AUC and FPS on Ped2 sabokrou2015real, where single-task methods are marked with black edges. The blue dashed line shows the typical tradeoff: higher accuracy often reduces speed, while our VADMamba++ notably breaks this trend.
  • Figure 2: Pipeline comparison between (a) VADMamba and (b) VADMamba++. We highlight three key evolutions: 1) Multi-input vs.Single-input; 2) Multi-task vs.Single-task; and 3) Inter-task fusion vs.Intra-task fusion.
  • Figure 3: Overview of the proposed VADMamba++, which employs an asymmetric Mamba-based encoder–decoder framework to colorize grayscale input frames into the next RGB frame. The TMC encoder combines Transformer, Mamba, and CNN blocks for spatiotemporal feature extraction, whereas the MC decoder reconstructs a colorized frame guided by the quantized representation from the VQ module.
  • Figure 4: The architectures of (a)BiSS2D and (b)DPA.
  • Figure 5: Examples of ground truth frames, predicted results, and error maps across five scenarios. The first four columns depict abnormal cases, whereas the last column presents a normal case. Brighter regions in the error maps indicate larger prediction errors.
  • ...and 3 more figures