Table of Contents
Fetching ...

DIFEM: Key-points Interaction based Feature Extraction Module for Violence Recognition in Videos

Himanshu Mittal, Suvramalya Basak, Anjali Gautam

TL;DR

This work tackles violence recognition in surveillance by proposing DIFEM, a lightweight feature extractor that uses OpenPose skeleton key-points to capture temporal joint motion (velocity) and inter-person spatial proximity (joint overlap). The DIFEM features are fed to conventional classifiers (Nearest Neighbor, AdaBoost, Decision Tree, Random Forest) to achieve competitive accuracy with far fewer parameters than deep learning methods. Across three standard datasets (RWF-2000, Hockey Fight, Crowd Violence), DIFEM demonstrates strong performance and robustness, with ablation studies confirming the value of combining velocity and overlap. The approach offers a practical, real-time alternative for violence detection in surveillance systems, balancing accuracy and computational efficiency.

Abstract

Violence detection in surveillance videos is a critical task for ensuring public safety. As a result, there is increasing need for efficient and lightweight systems for automatic detection of violent behaviours. In this work, we propose an effective method which leverages human skeleton key-points to capture inherent properties of violence, such as rapid movement of specific joints and their close proximity. At the heart of our method is our novel Dynamic Interaction Feature Extraction Module (DIFEM) which captures features such as velocity, and joint intersections, effectively capturing the dynamics of violent behavior. With the features extracted by our DIFEM, we use various classification algorithms such as Random Forest, Decision tree, AdaBoost and k-Nearest Neighbor. Our approach has substantially lesser amount of parameter expense than the existing state-of-the-art (SOTA) methods employing deep learning techniques. We perform extensive experiments on three standard violence recognition datasets, showing promising performance in all three datasets. Our proposed method surpasses several SOTA violence recognition methods.

DIFEM: Key-points Interaction based Feature Extraction Module for Violence Recognition in Videos

TL;DR

This work tackles violence recognition in surveillance by proposing DIFEM, a lightweight feature extractor that uses OpenPose skeleton key-points to capture temporal joint motion (velocity) and inter-person spatial proximity (joint overlap). The DIFEM features are fed to conventional classifiers (Nearest Neighbor, AdaBoost, Decision Tree, Random Forest) to achieve competitive accuracy with far fewer parameters than deep learning methods. Across three standard datasets (RWF-2000, Hockey Fight, Crowd Violence), DIFEM demonstrates strong performance and robustness, with ablation studies confirming the value of combining velocity and overlap. The approach offers a practical, real-time alternative for violence detection in surveillance systems, balancing accuracy and computational efficiency.

Abstract

Violence detection in surveillance videos is a critical task for ensuring public safety. As a result, there is increasing need for efficient and lightweight systems for automatic detection of violent behaviours. In this work, we propose an effective method which leverages human skeleton key-points to capture inherent properties of violence, such as rapid movement of specific joints and their close proximity. At the heart of our method is our novel Dynamic Interaction Feature Extraction Module (DIFEM) which captures features such as velocity, and joint intersections, effectively capturing the dynamics of violent behavior. With the features extracted by our DIFEM, we use various classification algorithms such as Random Forest, Decision tree, AdaBoost and k-Nearest Neighbor. Our approach has substantially lesser amount of parameter expense than the existing state-of-the-art (SOTA) methods employing deep learning techniques. We perform extensive experiments on three standard violence recognition datasets, showing promising performance in all three datasets. Our proposed method surpasses several SOTA violence recognition methods.

Paper Structure

This paper contains 24 sections, 2 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Overview of our proposed approach. The pose key-points of human objects are first extracted using the OpenPose cao2017realtime algorithm. From these key-point coordinates, temporal and spatial features are extracted, as explained in Section \ref{['DIFEM']}. These features, after concatenation, are provided to different classifiers.
  • Figure 2: Selected pose key-points, out of 25 OpenPose key-points, we have selected 11 which are most likely to have significant involvement in physical confrontations. These key-points are marked with a red circle.
  • Figure 3: Visualization for calculation of key-point velocity. The skeleton in blue denotes the pose key-points at time $t$, red denotes key-points at time $t+1$, green denotes key-points at time $t+2$. The velocity metric calculates how far a particular key-point $(x_{i,t},y_{j,t})$ at time $t$ moves to at time $t+1$, denoted by $(x_{i,t+1},y_{j,t+1})$. That is, the temporal dynamics measures the distance of every key-point in consecutive frames.
  • Figure 4: Visualization of key-point overlap measure. It is the count of the number of key-points from a particular person overlapping with the bounding box of another person. This gives an estimate of proximity of certain human joints in fighting actions. (a) The number of overlapping joints (marked in red) is 3; (b) Number of overlapping joints is 4.
  • Figure 5: Velocity and overlap measures averaged over all videos in test set of RWF-2000 dataset. "Fight" videos, on average, has higher velocity and joint overlap count than "Non-Fight" videos.
  • ...and 1 more figures