Table of Contents
Fetching ...

2D bidirectional gated recurrent unit convolutional Neural networks for end-to-end violence detection In videos

Abdarahmane Traoré, Moulay A. Akhloufi

TL;DR

The paper tackles violence detection in video by proposing an end-to-end architecture that combines a 2D CNN (based on VGG16) for spatial feature extraction with a Bidirectional GRU to model temporal dynamics. The approach aims to deliver competitive accuracy with lower computational cost than 3D-CNN-based methods, leveraging frame-level features passed through a BiGRU to capture temporal context. Evaluations on Hockey, Violent Flow, and Real Life Violence Situations datasets show strong performance (up to 98% on Hockey and 95.5% on Violent Flow) and good generalization, while highlighting the trade-off between accuracy and computational efficiency. The work suggests that 2D CNNs with temporal modeling can be a practical alternative for real-time violence surveillance, with future directions including optical-flow fusion and lightweight backbones for near real-time deployment.

Abstract

Abnormal behavior detection, action recognition, fight and violence detection in videos is an area that has attracted a lot of interest in recent years. In this work, we propose an architecture that combines a Bidirectional Gated Recurrent Unit (BiGRU) and a 2D Convolutional Neural Network (CNN) to detect violence in video sequences. A CNN is used to extract spatial characteristics from each frame, while the BiGRU extracts temporal and local motion characteristics using CNN extracted features from multiple frames. The proposed end-to-end deep learning network is tested in three public datasets with varying scene complexities. The proposed network achieves accuracies up to 98%. The obtained results are promising and show the performance of the proposed end-to-end approach.

2D bidirectional gated recurrent unit convolutional Neural networks for end-to-end violence detection In videos

TL;DR

The paper tackles violence detection in video by proposing an end-to-end architecture that combines a 2D CNN (based on VGG16) for spatial feature extraction with a Bidirectional GRU to model temporal dynamics. The approach aims to deliver competitive accuracy with lower computational cost than 3D-CNN-based methods, leveraging frame-level features passed through a BiGRU to capture temporal context. Evaluations on Hockey, Violent Flow, and Real Life Violence Situations datasets show strong performance (up to 98% on Hockey and 95.5% on Violent Flow) and good generalization, while highlighting the trade-off between accuracy and computational efficiency. The work suggests that 2D CNNs with temporal modeling can be a practical alternative for real-time violence surveillance, with future directions including optical-flow fusion and lightweight backbones for near real-time deployment.

Abstract

Abnormal behavior detection, action recognition, fight and violence detection in videos is an area that has attracted a lot of interest in recent years. In this work, we propose an architecture that combines a Bidirectional Gated Recurrent Unit (BiGRU) and a 2D Convolutional Neural Network (CNN) to detect violence in video sequences. A CNN is used to extract spatial characteristics from each frame, while the BiGRU extracts temporal and local motion characteristics using CNN extracted features from multiple frames. The proposed end-to-end deep learning network is tested in three public datasets with varying scene complexities. The proposed network achieves accuracies up to 98%. The obtained results are promising and show the performance of the proposed end-to-end approach.
Paper Structure (12 sections, 1 equation, 6 figures, 2 tables)

This paper contains 12 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: VGG16 used for capture spatial features
  • Figure 2: BiGRU used to capture temporal features
  • Figure 3: 2D BiGRU-CNN architecture
  • Figure 4: Frames from Hockey Dataset
  • Figure 5: Frames from ViolentFlow
  • ...and 1 more figures