2D bidirectional gated recurrent unit convolutional Neural networks for end-to-end violence detection In videos
Abdarahmane Traoré, Moulay A. Akhloufi
TL;DR
The paper tackles violence detection in video by proposing an end-to-end architecture that combines a 2D CNN (based on VGG16) for spatial feature extraction with a Bidirectional GRU to model temporal dynamics. The approach aims to deliver competitive accuracy with lower computational cost than 3D-CNN-based methods, leveraging frame-level features passed through a BiGRU to capture temporal context. Evaluations on Hockey, Violent Flow, and Real Life Violence Situations datasets show strong performance (up to 98% on Hockey and 95.5% on Violent Flow) and good generalization, while highlighting the trade-off between accuracy and computational efficiency. The work suggests that 2D CNNs with temporal modeling can be a practical alternative for real-time violence surveillance, with future directions including optical-flow fusion and lightweight backbones for near real-time deployment.
Abstract
Abnormal behavior detection, action recognition, fight and violence detection in videos is an area that has attracted a lot of interest in recent years. In this work, we propose an architecture that combines a Bidirectional Gated Recurrent Unit (BiGRU) and a 2D Convolutional Neural Network (CNN) to detect violence in video sequences. A CNN is used to extract spatial characteristics from each frame, while the BiGRU extracts temporal and local motion characteristics using CNN extracted features from multiple frames. The proposed end-to-end deep learning network is tested in three public datasets with varying scene complexities. The proposed network achieves accuracies up to 98%. The obtained results are promising and show the performance of the proposed end-to-end approach.
