Application of Attention Mechanism with Bidirectional Long Short-Term Memory (BiLSTM) and CNN for Human Conflict Detection using Computer Vision
Erick da Silva Farias, Eduardo Palhares Junior
TL;DR
This work addresses automatic violence detection in surveillance videos by integrating CNN-based spatial feature extraction with BiLSTM temporal modeling and an attention mechanism that emphasizes informative frames. The designed architecture processes 15-frame sequences through a TimeDistributed CNN, BiLSTM, and a simple weighted-sum attention before final dense classification, and is evaluated across three CNN backbones (MobileNetV2, DenseNet121, InceptionV3). Results show MobileNetV2 achieving the best accuracy (up to 96.50%) under a low learning rate and small batch size, with attention yielding mixed benefits depending on hyperparameters while not increasing training time. The study highlights the importance of hyperparameter tuning and suggests multimodal data as a promising avenue for more robust real-time conflict detection in video surveillance.
Abstract
The automatic detection of human conflicts through videos is a crucial area in computer vision, with significant applications in monitoring and public safety policies. However, the scarcity of public datasets and the complexity of human interactions make this task challenging. This study investigates the integration of advanced deep learning techniques, including Attention Mechanism, Convolutional Neural Networks (CNNs), and Bidirectional Long ShortTerm Memory (BiLSTM), to improve the detection of violent behaviors in videos. The research explores how the use of the attention mechanism can help focus on the most relevant parts of the video, enhancing the accuracy and robustness of the model. The experiments indicate that the combination of CNNs with BiLSTM and the attention mechanism provides a promising solution for conflict monitoring, offering insights into the effectiveness of different strategies. This work opens new possibilities for the development of automated surveillance systems that can operate more efficiently in real-time detection of violent events.
