Multimodal Attention-Enhanced Feature Fusion-based Weekly Supervised Anomaly Violence Detection

Yuta Kaneko; Abu Saleh Musa Miah; Najmul Hassan; Hyoun-Sup Lee; Si-Woong Jang; Jungpil Shin

Multimodal Attention-Enhanced Feature Fusion-based Weekly Supervised Anomaly Violence Detection

Yuta Kaneko, Abu Saleh Musa Miah, Najmul Hassan, Hyoun-Sup Lee, Si-Woong Jang, Jungpil Shin

TL;DR

The paper tackles weakly supervised video anomaly detection by leveraging a multimodal fusion framework that combines RGB, flow, and audio cues. It introduces a three-stream RGB pipeline (ViT-CLIP top-k features, I3D with Temporal Context Aggregation, and UR-DMU with memory-augmented attention), a flow stream (I3D+MLP+Transformer), and an audio stream (VGGish+Transformer), all fused through gated attention and trained with MIL-based objectives. Key contributions include the integration of CLIP-based features with long-range temporal modeling, a memory-augmented UR-DMU module with dual losses and Normal Data Uncertainty Learning, and a comprehensive experimental evaluation showing state-of-the-art performance on XD-Violence and strong results on ShanghaiTech and UCF-Crime. The results demonstrate that cross-modal temporal context and memory-augmented representations significantly improve anomaly detection robustness, with practical implications for real-world surveillance systems that require accurate, scalable, weakly supervised detection. Overall, the work advances WS-VAD by effectively combining multimodal cues and temporal reasoning to surpass prior approaches in both accuracy and robustness.

Abstract

Weakly supervised video anomaly detection (WS-VAD) is a crucial area in computer vision for developing intelligent surveillance systems. This system uses three feature streams: RGB video, optical flow, and audio signals, where each stream extracts complementary spatial and temporal features using an enhanced attention module to improve detection accuracy and robustness. In the first stream, we employed an attention-based, multi-stage feature enhancement approach to improve spatial and temporal features from the RGB video where the first stage consists of a ViT-based CLIP module, with top-k features concatenated in parallel with I3D and Temporal Contextual Aggregation (TCA) based rich spatiotemporal features. The second stage effectively captures temporal dependencies using the Uncertainty-Regulated Dual Memory Units (UR-DMU) model, which learns representations of normal and abnormal data simultaneously, and the third stage is employed to select the most relevant spatiotemporal features. The second stream extracted enhanced attention-based spatiotemporal features from the flow data modality-based feature by taking advantage of the integration of the deep learning and attention module. The audio stream captures auditory cues using an attention module integrated with the VGGish model, aiming to detect anomalies based on sound patterns. These streams enrich the model by incorporating motion and audio signals often indicative of abnormal events undetectable through visual analysis alone. The concatenation of the multimodal fusion leverages the strengths of each modality, resulting in a comprehensive feature set that significantly improves anomaly detection accuracy and robustness across three datasets. The extensive experiment and high performance with the three benchmark datasets proved the effectiveness of the proposed system over the existing state-of-the-art system.

Multimodal Attention-Enhanced Feature Fusion-based Weekly Supervised Anomaly Violence Detection

TL;DR

Abstract

Multimodal Attention-Enhanced Feature Fusion-based Weekly Supervised Anomaly Violence Detection

Authors

TL;DR

Abstract

Table of Contents