Enhancing Human Action Recognition and Violence Detection Through Deep Learning Audiovisual Fusion
Pooya Janani, Amirabolfazl Suratgar, Afshin Taghvaeipour
TL;DR
This work tackles violence detection and human action recognition in public spaces by proposing a hybrid fusion-based deep learning (HFBDL) framework that combines audio and video modalities. Using pretrained audio (VGGish) and video (I3D) models, the authors compare early, intermediate, late, and a hybrid fusion strategy, ultimately selecting HFBDL to integrate high-level features and modality-specific outputs. The study expands and augments the RLVS dataset to 300 violent and 300 non-violent samples, along with a real-world 54-video test set, reporting up to 96.29% accuracy on unseen data and strong gains over unimodal and single-fusion baselines. The approach demonstrates practical viability for security applications, such as deploying an interactive robot for public-space monitoring, and points to future improvements with attention mechanisms to further enhance robustness.
Abstract
This paper proposes a hybrid fusion-based deep learning approach based on two different modalities, audio and video, to improve human activity recognition and violence detection in public places. To take advantage of audiovisual fusion, late fusion, intermediate fusion, and hybrid fusion-based deep learning (HFBDL) are used and compared. Since the objective is to detect and recognize human violence in public places, Real-life violence situation (RLVS) dataset is expanded and used. Simulating results of HFBDL show 96.67\% accuracy on validation data, which is more accurate than the other state-of-the-art methods on this dataset. To showcase our model's ability in real-world scenarios, another dataset of 54 sounded videos of both violent and non-violent situations was recorded. The model could successfully detect 52 out of 54 videos correctly. The proposed method shows a promising performance on real scenarios. Thus, it can be used for human action recognition and violence detection in public places for security purposes.
