Table of Contents
Fetching ...

A Computer Vision Based Approach for Stalking Detection Using a CNN-LSTM-MLP Hybrid Fusion Model

Murad Hasan, Shahriar Iqbal, Md. Billal Hossain Faisal, Md. Musnad Hossin Neloy, Md. Tonmoy Kabir, Md. Tanzim Reza, Md. Golam Rabiul Alam, Md Zia Uddin

TL;DR

The paper tackles physical stalking detection in public spaces using video analysis. It introduces a CNN-LSTM-MLP hybrid fusion model that merges ConvLSTM-based spatiotemporal features with an MLP processing numerical facial features (facial landmarks, head pose, and relative distance) to classify stalking versus non-stalking from a small set of frames. A new single-video stalking dataset is created, comprising 238 videos (117 stalking, 121 non-stalking) sourced from feature films and TV, with frames trimmed to 3–8 seconds and annotated by five human raters. The proposed fusion approach achieves 89.58% testing accuracy, outperforming CNN- and ConvLSTM-based baselines and demonstrating the value of integrating facial-feature cues into video-based stalking detection for potential surveillance applications.

Abstract

Criminal and suspicious activity detection has become a popular research topic in recent years. The rapid growth of computer vision technologies has had a crucial impact on solving this issue. However, physical stalking detection is still a less explored area despite the evolution of modern technology. Nowadays, stalking in public places has become a common occurrence with women being the most affected. Stalking is a visible action that usually occurs before any criminal activity begins as the stalker begins to follow, loiter, and stare at the victim before committing any criminal activity such as assault, kidnapping, rape, and so on. Therefore, it has become a necessity to detect stalking as all of these criminal activities can be stopped in the first place through stalking detection. In this research, we propose a novel deep learning-based hybrid fusion model to detect potential stalkers from a single video with a minimal number of frames. We extract multiple relevant features, such as facial landmarks, head pose estimation, and relative distance, as numerical values from video frames. This data is fed into a multilayer perceptron (MLP) to perform a classification task between a stalking and a non-stalking scenario. Simultaneously, the video frames are fed into a combination of convolutional and LSTM models to extract the spatio-temporal features. We use a fusion of these numerical and spatio-temporal features to build a classifier to detect stalking incidents. Additionally, we introduce a dataset consisting of stalking and non-stalking videos gathered from various feature films and television series, which is also used to train the model. The experimental results show the efficiency and dynamism of our proposed stalker detection system, achieving 89.58% testing accuracy with a significant improvement as compared to the state-of-the-art approaches.

A Computer Vision Based Approach for Stalking Detection Using a CNN-LSTM-MLP Hybrid Fusion Model

TL;DR

The paper tackles physical stalking detection in public spaces using video analysis. It introduces a CNN-LSTM-MLP hybrid fusion model that merges ConvLSTM-based spatiotemporal features with an MLP processing numerical facial features (facial landmarks, head pose, and relative distance) to classify stalking versus non-stalking from a small set of frames. A new single-video stalking dataset is created, comprising 238 videos (117 stalking, 121 non-stalking) sourced from feature films and TV, with frames trimmed to 3–8 seconds and annotated by five human raters. The proposed fusion approach achieves 89.58% testing accuracy, outperforming CNN- and ConvLSTM-based baselines and demonstrating the value of integrating facial-feature cues into video-based stalking detection for potential surveillance applications.

Abstract

Criminal and suspicious activity detection has become a popular research topic in recent years. The rapid growth of computer vision technologies has had a crucial impact on solving this issue. However, physical stalking detection is still a less explored area despite the evolution of modern technology. Nowadays, stalking in public places has become a common occurrence with women being the most affected. Stalking is a visible action that usually occurs before any criminal activity begins as the stalker begins to follow, loiter, and stare at the victim before committing any criminal activity such as assault, kidnapping, rape, and so on. Therefore, it has become a necessity to detect stalking as all of these criminal activities can be stopped in the first place through stalking detection. In this research, we propose a novel deep learning-based hybrid fusion model to detect potential stalkers from a single video with a minimal number of frames. We extract multiple relevant features, such as facial landmarks, head pose estimation, and relative distance, as numerical values from video frames. This data is fed into a multilayer perceptron (MLP) to perform a classification task between a stalking and a non-stalking scenario. Simultaneously, the video frames are fed into a combination of convolutional and LSTM models to extract the spatio-temporal features. We use a fusion of these numerical and spatio-temporal features to build a classifier to detect stalking incidents. Additionally, we introduce a dataset consisting of stalking and non-stalking videos gathered from various feature films and television series, which is also used to train the model. The experimental results show the efficiency and dynamism of our proposed stalker detection system, achieving 89.58% testing accuracy with a significant improvement as compared to the state-of-the-art approaches.
Paper Structure (21 sections, 12 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 21 sections, 12 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Top-level view of the proposed model.
  • Figure 2: A comparison is made between video frame with backgrounds and video frame without backgrounds.
  • Figure 3: An illustration of facial landmark points (a) and their corresponding output after the selection of a limited number of points (b).
  • Figure 4: The output of head pose angles demonstrated by drawing lines on images.
  • Figure 5: Comparison of relative distance between non-stalking and stalking cases.
  • ...and 5 more figures