Table of Contents
Fetching ...

Streamlining Video Analysis for Efficient Violence Detection

Gourang Pathak, Abhay Kumar, Sannidhya Rawat, Shikha Gupta

TL;DR

The paper tackles automated violence detection in CCTV video by classifying scenes as fight vs non-fight using the efficient 3D CNN X3D. It introduces data augmentation and localized tube extraction with IoU-based bounding-box clustering to robustly localize fights in cluttered surveillance footage. Training on multiple datasets (RWF-2000, Fight Detection Surveillance, staged enactments) yields an overall accuracy of about 0.86 with precision and sensitivity near 0.87. The method demonstrates strong potential for real-time security systems and can be extended to related behaviours such as object violence or person collapse in diverse lighting and occlusion conditions.

Abstract

This paper addresses the challenge of automated violence detection in video frames captured by surveillance cameras, specifically focusing on classifying scenes as "fight" or "non-fight." This task is critical for enhancing unmanned security systems, online content filtering, and related applications. We propose an approach using a 3D Convolutional Neural Network (3D CNN)-based model named X3D to tackle this problem. Our approach incorporates pre-processing steps such as tube extraction, volume cropping, and frame aggregation, combined with clustering techniques, to accurately localize and classify fight scenes. Extensive experimentation demonstrates the effectiveness of our method in distinguishing violent from non-violent events, providing valuable insights for advancing practical violence detection systems.

Streamlining Video Analysis for Efficient Violence Detection

TL;DR

The paper tackles automated violence detection in CCTV video by classifying scenes as fight vs non-fight using the efficient 3D CNN X3D. It introduces data augmentation and localized tube extraction with IoU-based bounding-box clustering to robustly localize fights in cluttered surveillance footage. Training on multiple datasets (RWF-2000, Fight Detection Surveillance, staged enactments) yields an overall accuracy of about 0.86 with precision and sensitivity near 0.87. The method demonstrates strong potential for real-time security systems and can be extended to related behaviours such as object violence or person collapse in diverse lighting and occlusion conditions.

Abstract

This paper addresses the challenge of automated violence detection in video frames captured by surveillance cameras, specifically focusing on classifying scenes as "fight" or "non-fight." This task is critical for enhancing unmanned security systems, online content filtering, and related applications. We propose an approach using a 3D Convolutional Neural Network (3D CNN)-based model named X3D to tackle this problem. Our approach incorporates pre-processing steps such as tube extraction, volume cropping, and frame aggregation, combined with clustering techniques, to accurately localize and classify fight scenes. Extensive experimentation demonstrates the effectiveness of our method in distinguishing violent from non-violent events, providing valuable insights for advancing practical violence detection systems.

Paper Structure

This paper contains 4 sections, 3 figures, 1 table, 1 algorithm.

Figures (3)

  • Figure 1: Data Augmentation
  • Figure 2: Framework for tube extraction and its related applications
  • Figure 3: Extracted Fight Tube