Table of Contents
Fetching ...

CUE-Net: Violence Detection Video Analytics with Spatial Cropping, Enhanced UniformerV2 and Modified Efficient Additive Attention

Damith Chamalke Senadeera, Xiaoyun Yang, Dimitrios Kollias, Gregory Slabaugh

TL;DR

Violence detection in surveillance video is challenged by distant, occluded, and context-dependent actions. The authors propose CUE-Net, which combines spatial Cropping with an enhanced UniformerV2 architecture and a Modified Efficient Additive Attention (MEAA) to capture local and global spatio-temporal cues. They introduce LT_MHRA and GT_MHRA components and a Fusion Block, with MEAA replacing standard self-attention in the global path. Evaluated on RWF-2000 and RLVS, CUE-Net achieves state-of-the-art accuracies (94.00% and 99.50%), with ablations supporting the effectiveness of spatial cropping and MEAA. This approach offers a scalable and efficient violence-detection solution for real-world surveillance.

Abstract

In this paper we introduce CUE-Net, a novel architecture designed for automated violence detection in video surveillance. As surveillance systems become more prevalent due to technological advances and decreasing costs, the challenge of efficiently monitoring vast amounts of video data has intensified. CUE-Net addresses this challenge by combining spatial Cropping with an enhanced version of the UniformerV2 architecture, integrating convolutional and self-attention mechanisms alongside a novel Modified Efficient Additive Attention mechanism (which reduces the quadratic time complexity of self-attention) to effectively and efficiently identify violent activities. This approach aims to overcome traditional challenges such as capturing distant or partially obscured subjects within video frames. By focusing on both local and global spatiotemporal features, CUE-Net achieves state-of-the-art performance on the RWF-2000 and RLVS datasets, surpassing existing methods.

CUE-Net: Violence Detection Video Analytics with Spatial Cropping, Enhanced UniformerV2 and Modified Efficient Additive Attention

TL;DR

Violence detection in surveillance video is challenged by distant, occluded, and context-dependent actions. The authors propose CUE-Net, which combines spatial Cropping with an enhanced UniformerV2 architecture and a Modified Efficient Additive Attention (MEAA) to capture local and global spatio-temporal cues. They introduce LT_MHRA and GT_MHRA components and a Fusion Block, with MEAA replacing standard self-attention in the global path. Evaluated on RWF-2000 and RLVS, CUE-Net achieves state-of-the-art accuracies (94.00% and 99.50%), with ablations supporting the effectiveness of spatial cropping and MEAA. This approach offers a scalable and efficient violence-detection solution for real-world surveillance.

Abstract

In this paper we introduce CUE-Net, a novel architecture designed for automated violence detection in video surveillance. As surveillance systems become more prevalent due to technological advances and decreasing costs, the challenge of efficiently monitoring vast amounts of video data has intensified. CUE-Net addresses this challenge by combining spatial Cropping with an enhanced version of the UniformerV2 architecture, integrating convolutional and self-attention mechanisms alongside a novel Modified Efficient Additive Attention mechanism (which reduces the quadratic time complexity of self-attention) to effectively and efficiently identify violent activities. This approach aims to overcome traditional challenges such as capturing distant or partially obscured subjects within video frames. By focusing on both local and global spatiotemporal features, CUE-Net achieves state-of-the-art performance on the RWF-2000 and RLVS datasets, surpassing existing methods.
Paper Structure (22 sections, 7 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 7 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Sample violence detection videos. (a) is a set of frames from a challenging video from RWF-2000 where the people involved in the violent incident are far away from the camera, occupying only a small part of the frame. (b) shows a typical violent video from the RWF-2000 dataset correctly classified by CUE-Net. (c) is a video from the RWF-2000 dataset test split, where a man makes punching actions but is not really engaging in a fight. CUE-Net incorrectly classifies this as a violent video. (d) is a video from the RLVS dataset which CUE-Net correctly classifies as non-violent, but for which the ground-truth is mislabeled as violent.
  • Figure 2: The overall CUE-Net architecture with its main components. (a) the Spatial Cropping Module uses the YOLO V8 algorithm to detect people and crop the video spatially; (b) the 3D Convolutional Block which is used to encode and downsample the frames spatio-temporally; (c) the Local UniBlock V2 which is mainly used to capture the important local dependencies with its main components LT_MHRA, GS_MHRA and a feed forward network (FFN); (d) the Global UniBlock V3 which is mainly used to capture the important global spatio-temporal dependencies, with its main components Dynamic Positional Embedding (DPE) unit, MEAA unit which implements a novel efficient self-attention mechanism and a feed forward network (FFN); (e) the Fusion Block which is used to fuse the outputs of the Local UniBlock V2 and Global UniBlock V3.
  • Figure 3: (a) illustrates the Efficient Additive Attention where the expensive matrix multiplication operations have been replaced with element-wise multiplications and linear transformations via a query-key pair interaction. (b) represents the Modified Efficient Additive Attention (MEAA) which only uses a query vector instead of a whole query matrix when computing Efficient Additive Attention, reducing the computational complexity along with memory usage.