Table of Contents
Fetching ...

JOSENet: A Joint Stream Embedding Network for Violence Detection in Surveillance Videos

Pietro Nardelli, Danilo Comminiello

TL;DR

Violence detection in surveillance videos is hampered by diverse scenes, limited labels, and real-time constraints. JOSENet combines a lightweight two-stream Flow Gated Network with a VICReg-based self-supervised pretraining regime to learn robust multimodal video representations from unlabeled data, while aggressively reducing frame usage and computational load. The approach delivers competitive accuracy and AUC on RWF-2000, with strong gains over non-regularized SSL baselines and good generalization to action recognition tasks, aided by a novel Zoom Crop augmentation and careful architectural choices. This yields a practical, scalable framework for real-time violence detection with strong generalization across domains and tasks.

Abstract

The increasing proliferation of video surveillance cameras and the escalating demand for crime prevention have intensified interest in the task of violence detection within the research community. Compared to other action recognition tasks, violence detection in surveillance videos presents additional issues, such as the wide variety of real fight scenes. Unfortunately, existing datasets for violence detection are relatively small in comparison to those for other action recognition tasks. Moreover, surveillance footage often features different individuals in each video and varying backgrounds for each camera. In addition, fast detection of violent actions in real-life surveillance videos is crucial to prevent adverse outcomes, thus necessitating models that are optimized for reduced memory usage and computational costs. These challenges complicate the application of traditional action recognition methods. To tackle all these issues, we introduce JOSENet, a novel self-supervised framework that provides outstanding performance for violence detection in surveillance videos. The proposed model processes two spatiotemporal video streams, namely RGB frames and optical flows, and incorporates a new regularized self-supervised learning approach for videos. JOSENet demonstrates improved performance compared to state-of-the-art methods, while utilizing only one-fourth of the frames per video segment and operating at a reduced frame rate. The source code is available at https://github.com/ispamm/JOSENet.

JOSENet: A Joint Stream Embedding Network for Violence Detection in Surveillance Videos

TL;DR

Violence detection in surveillance videos is hampered by diverse scenes, limited labels, and real-time constraints. JOSENet combines a lightweight two-stream Flow Gated Network with a VICReg-based self-supervised pretraining regime to learn robust multimodal video representations from unlabeled data, while aggressively reducing frame usage and computational load. The approach delivers competitive accuracy and AUC on RWF-2000, with strong gains over non-regularized SSL baselines and good generalization to action recognition tasks, aided by a novel Zoom Crop augmentation and careful architectural choices. This yields a practical, scalable framework for real-time violence detection with strong generalization across domains and tasks.

Abstract

The increasing proliferation of video surveillance cameras and the escalating demand for crime prevention have intensified interest in the task of violence detection within the research community. Compared to other action recognition tasks, violence detection in surveillance videos presents additional issues, such as the wide variety of real fight scenes. Unfortunately, existing datasets for violence detection are relatively small in comparison to those for other action recognition tasks. Moreover, surveillance footage often features different individuals in each video and varying backgrounds for each camera. In addition, fast detection of violent actions in real-life surveillance videos is crucial to prevent adverse outcomes, thus necessitating models that are optimized for reduced memory usage and computational costs. These challenges complicate the application of traditional action recognition methods. To tackle all these issues, we introduce JOSENet, a novel self-supervised framework that provides outstanding performance for violence detection in surveillance videos. The proposed model processes two spatiotemporal video streams, namely RGB frames and optical flows, and incorporates a new regularized self-supervised learning approach for videos. JOSENet demonstrates improved performance compared to state-of-the-art methods, while utilizing only one-fourth of the frames per video segment and operating at a reduced frame rate. The source code is available at https://github.com/ispamm/JOSENet.
Paper Structure (26 sections, 1 equation, 4 figures, 9 tables)

This paper contains 26 sections, 1 equation, 4 figures, 9 tables.

Figures (4)

  • Figure 1: The proposed JOSENet architecture receives as input RGB and optical flow segments as a batch of size $N$, where each segment is made of $L$ frames. An example of the RGB (top-left) and optical flow segments (bottom-left) is shown here. Both RGB (top-right) and optical flow segments (bottom-right) are augmented. In particular, a strong random cropping strategy and some other augmentation techniques are applied to RGB frames while the optical flow segments are flipped horizontally.
  • Figure 2: The proposed JOSENet framework. The primary target model (top) is tackled by using a novel efficient flow gated network (FGN) which produces binary classification (1 if violence is detected, 0 otherwise) given optical flow and RGB segments. The FGN is pretrained by using a novel two-stream SSL method (bottom) that aims to solve an auxiliary task with unlabeled input data.
  • Figure 3: The proposed VICReg solution for the auxiliary model of the JOSENet framework. $I$ and $I'$ are respectively a batch of RGB and flow segments that are transformed through data augmentation into two different views $X$ and $X'$. In particular, a strong random cropping strategy and some other augmentation techniques are applied to RGB frames while the flow frames are only flipped horizontally. The RGB branch is represented by $f_\theta$, the optical flow branch is $f'_{\theta'}$, $m_\gamma$ is the merging block without the temporal max pooling and finally, the $h_\phi$ is the expander as in the VICReg original implementation. The VICReg loss function $L(Z, Z')$ is computed on the embeddings $Z$ and $Z'$.
  • Figure 4: Normalized confusion matrix (left) and ROC curve (right) obtained by evaluating our best model on the RWF-2000 validation set by pretraining via our VICReg proposed method with 64 batch size on the entire UCF-Crime dataset. The rows and columns of the confusion matrix represent respectively the predicted and target labels.