DIFEM: Key-points Interaction based Feature Extraction Module for Violence Recognition in Videos
Himanshu Mittal, Suvramalya Basak, Anjali Gautam
TL;DR
This work tackles violence recognition in surveillance by proposing DIFEM, a lightweight feature extractor that uses OpenPose skeleton key-points to capture temporal joint motion (velocity) and inter-person spatial proximity (joint overlap). The DIFEM features are fed to conventional classifiers (Nearest Neighbor, AdaBoost, Decision Tree, Random Forest) to achieve competitive accuracy with far fewer parameters than deep learning methods. Across three standard datasets (RWF-2000, Hockey Fight, Crowd Violence), DIFEM demonstrates strong performance and robustness, with ablation studies confirming the value of combining velocity and overlap. The approach offers a practical, real-time alternative for violence detection in surveillance systems, balancing accuracy and computational efficiency.
Abstract
Violence detection in surveillance videos is a critical task for ensuring public safety. As a result, there is increasing need for efficient and lightweight systems for automatic detection of violent behaviours. In this work, we propose an effective method which leverages human skeleton key-points to capture inherent properties of violence, such as rapid movement of specific joints and their close proximity. At the heart of our method is our novel Dynamic Interaction Feature Extraction Module (DIFEM) which captures features such as velocity, and joint intersections, effectively capturing the dynamics of violent behavior. With the features extracted by our DIFEM, we use various classification algorithms such as Random Forest, Decision tree, AdaBoost and k-Nearest Neighbor. Our approach has substantially lesser amount of parameter expense than the existing state-of-the-art (SOTA) methods employing deep learning techniques. We perform extensive experiments on three standard violence recognition datasets, showing promising performance in all three datasets. Our proposed method surpasses several SOTA violence recognition methods.
