Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection
Ayush Ghadiya, Purbayan Kar, Vishal Chudasama, Pankaj Wasnik
TL;DR
The paper addresses WS-VAD for violence and nudity by jointly modeling audio and visual cues. It introduces a Cross-modal Fusion Adapter CFA to adaptively fuse modalities and a Hyperbolic Lorentzian Graph Attention HLGAtt to learn hierarchical distinctions between normal and abnormal representations within hyperbolic space, trained with a MIL objective. Empirically, the approach achieves state-of-the-art results on XD-Violence (AP $=86.34\%$) and NPDI Nudity (AP $=99.45\%$), with ablations confirming the value of CFA and HLGAtt components and the prefix-tuning design. The work advances robust, low-label WS-VAD for complex real-world content moderation by improving modality balance handling and discriminative feature separation.
Abstract
Recently, weakly supervised video anomaly detection (WS-VAD) has emerged as a contemporary research direction to identify anomaly events like violence and nudity in videos using only video-level labels. However, this task has substantial challenges, including addressing imbalanced modality information and consistently distinguishing between normal and abnormal features. In this paper, we address these challenges and propose a multi-modal WS-VAD framework to accurately detect anomalies such as violence and nudity. Within the proposed framework, we introduce a new fusion mechanism known as the Cross-modal Fusion Adapter (CFA), which dynamically selects and enhances highly relevant audio-visual features in relation to the visual modality. Additionally, we introduce a Hyperbolic Lorentzian Graph Attention (HLGAtt) to effectively capture the hierarchical relationships between normal and abnormal representations, thereby enhancing feature separation accuracy. Through extensive experiments, we demonstrate that the proposed model achieves state-of-the-art results on benchmark datasets of violence and nudity detection.
