Streamlining Video Analysis for Efficient Violence Detection
Gourang Pathak, Abhay Kumar, Sannidhya Rawat, Shikha Gupta
TL;DR
The paper tackles automated violence detection in CCTV video by classifying scenes as fight vs non-fight using the efficient 3D CNN X3D. It introduces data augmentation and localized tube extraction with IoU-based bounding-box clustering to robustly localize fights in cluttered surveillance footage. Training on multiple datasets (RWF-2000, Fight Detection Surveillance, staged enactments) yields an overall accuracy of about 0.86 with precision and sensitivity near 0.87. The method demonstrates strong potential for real-time security systems and can be extended to related behaviours such as object violence or person collapse in diverse lighting and occlusion conditions.
Abstract
This paper addresses the challenge of automated violence detection in video frames captured by surveillance cameras, specifically focusing on classifying scenes as "fight" or "non-fight." This task is critical for enhancing unmanned security systems, online content filtering, and related applications. We propose an approach using a 3D Convolutional Neural Network (3D CNN)-based model named X3D to tackle this problem. Our approach incorporates pre-processing steps such as tube extraction, volume cropping, and frame aggregation, combined with clustering techniques, to accurately localize and classify fight scenes. Extensive experimentation demonstrates the effectiveness of our method in distinguishing violent from non-violent events, providing valuable insights for advancing practical violence detection systems.
