Table of Contents
Fetching ...

Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection

Ayush Ghadiya, Purbayan Kar, Vishal Chudasama, Pankaj Wasnik

TL;DR

The paper addresses WS-VAD for violence and nudity by jointly modeling audio and visual cues. It introduces a Cross-modal Fusion Adapter CFA to adaptively fuse modalities and a Hyperbolic Lorentzian Graph Attention HLGAtt to learn hierarchical distinctions between normal and abnormal representations within hyperbolic space, trained with a MIL objective. Empirically, the approach achieves state-of-the-art results on XD-Violence (AP $=86.34\%$) and NPDI Nudity (AP $=99.45\%$), with ablations confirming the value of CFA and HLGAtt components and the prefix-tuning design. The work advances robust, low-label WS-VAD for complex real-world content moderation by improving modality balance handling and discriminative feature separation.

Abstract

Recently, weakly supervised video anomaly detection (WS-VAD) has emerged as a contemporary research direction to identify anomaly events like violence and nudity in videos using only video-level labels. However, this task has substantial challenges, including addressing imbalanced modality information and consistently distinguishing between normal and abnormal features. In this paper, we address these challenges and propose a multi-modal WS-VAD framework to accurately detect anomalies such as violence and nudity. Within the proposed framework, we introduce a new fusion mechanism known as the Cross-modal Fusion Adapter (CFA), which dynamically selects and enhances highly relevant audio-visual features in relation to the visual modality. Additionally, we introduce a Hyperbolic Lorentzian Graph Attention (HLGAtt) to effectively capture the hierarchical relationships between normal and abnormal representations, thereby enhancing feature separation accuracy. Through extensive experiments, we demonstrate that the proposed model achieves state-of-the-art results on benchmark datasets of violence and nudity detection.

Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection

TL;DR

The paper addresses WS-VAD for violence and nudity by jointly modeling audio and visual cues. It introduces a Cross-modal Fusion Adapter CFA to adaptively fuse modalities and a Hyperbolic Lorentzian Graph Attention HLGAtt to learn hierarchical distinctions between normal and abnormal representations within hyperbolic space, trained with a MIL objective. Empirically, the approach achieves state-of-the-art results on XD-Violence (AP ) and NPDI Nudity (AP ), with ablations confirming the value of CFA and HLGAtt components and the prefix-tuning design. The work advances robust, low-label WS-VAD for complex real-world content moderation by improving modality balance handling and discriminative feature separation.

Abstract

Recently, weakly supervised video anomaly detection (WS-VAD) has emerged as a contemporary research direction to identify anomaly events like violence and nudity in videos using only video-level labels. However, this task has substantial challenges, including addressing imbalanced modality information and consistently distinguishing between normal and abnormal features. In this paper, we address these challenges and propose a multi-modal WS-VAD framework to accurately detect anomalies such as violence and nudity. Within the proposed framework, we introduce a new fusion mechanism known as the Cross-modal Fusion Adapter (CFA), which dynamically selects and enhances highly relevant audio-visual features in relation to the visual modality. Additionally, we introduce a Hyperbolic Lorentzian Graph Attention (HLGAtt) to effectively capture the hierarchical relationships between normal and abnormal representations, thereby enhancing feature separation accuracy. Through extensive experiments, we demonstrate that the proposed model achieves state-of-the-art results on benchmark datasets of violence and nudity detection.
Paper Structure (15 sections, 13 equations, 8 figures, 3 tables)

This paper contains 15 sections, 13 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Comparative analysis of our proposed method with prior video-based method as well as audio-video based fusion approaches XDviolenceHyperVD on testing videos of XD-Violence dataset.
  • Figure 2: Overview of the proposed framework. It takes audio and visual features extracted from pre-trained encoder networks as input, which are further fused through the proposed Cross-Modal Fusion Adapter (CFA) module to learn multi-modal interaction effectively, followed by the introduced Hyperbolic Lorentzian Graph Attention (HLGAtt) mechanism to capture hierarchical relationships between visual and audio representations, ensuring consistency in distinguishing normal and abnormal features during training. Finally, the outcome features are passed in a hyperbolic classifier to predict anomaly events for each instance.
  • Figure 3: Visual comparison in terms of anomaly score curves on sample video of XD-Violence dataset. Yellow regions are the temporal ground-truths of violent events.
  • Figure 4: Visual comparison on normal and violence features of the proposed and HyperVD HyperVD methods on XD-Violence dataset.
  • Figure 5: Visual comparison between proposed model and HyperVD HyperVD in terms of Anomaly Score vs Time. Yellow regions are the temporal ground-truths of nudity events.
  • ...and 3 more figures