Table of Contents
Fetching ...

Learning Weakly Supervised Audio-Visual Violence Detection in Hyperbolic Space

Xiaogang Peng, Hao Wen, Yikai Luo, Xiao Zhou, Keyang Yu, Ping Yang, Zizhao Wu

TL;DR

This paper tackles weakly supervised audio-visual violence detection under video-level labels, identifying limitations of Euclidean feature spaces in capturing hierarchical semantic structure. It introduces HyperVD, a framework that learns snippet embeddings in hyperbolic space using a detour fusion module for multimodal fusion and two fully hyperbolic graph convolutional branches (HFSG and HTRG) plus a hyperbolic classifier. On the XD-Violence benchmark, HyperVD achieves state-of-the-art AP (85.67%) and demonstrates robust ablations showing the effectiveness of detour fusion, Lorentz-based hyperbolic learning, and the two hyperbolic graph branches, with a compact model size (~0.607M). The results indicate that hyperbolic geometry better encodes semantic discrepancies between violent and normal events, enabling stronger discrimination and more reliable localization in video data, while maintaining computational efficiency and training stability via the Lorentz model.

Abstract

In recent years, the task of weakly supervised audio-visual violence detection has gained considerable attention. The goal of this task is to identify violent segments within multimodal data based on video-level labels. Despite advances in this field, traditional Euclidean neural networks, which have been used in prior research, encounter difficulties in capturing highly discriminative representations due to limitations of the feature space. To overcome this, we propose HyperVD, a novel framework that learns snippet embeddings in hyperbolic space to improve model discrimination. Our framework comprises a detour fusion module for multimodal fusion, effectively alleviating modality inconsistency between audio and visual signals. Additionally, we contribute two branches of fully hyperbolic graph convolutional networks that excavate feature similarities and temporal relationships among snippets in hyperbolic space. By learning snippet representations in this space, the framework effectively learns semantic discrepancies between violent and normal events. Extensive experiments on the XD-Violence benchmark demonstrate that our method outperforms state-of-the-art methods by a sizable margin.

Learning Weakly Supervised Audio-Visual Violence Detection in Hyperbolic Space

TL;DR

This paper tackles weakly supervised audio-visual violence detection under video-level labels, identifying limitations of Euclidean feature spaces in capturing hierarchical semantic structure. It introduces HyperVD, a framework that learns snippet embeddings in hyperbolic space using a detour fusion module for multimodal fusion and two fully hyperbolic graph convolutional branches (HFSG and HTRG) plus a hyperbolic classifier. On the XD-Violence benchmark, HyperVD achieves state-of-the-art AP (85.67%) and demonstrates robust ablations showing the effectiveness of detour fusion, Lorentz-based hyperbolic learning, and the two hyperbolic graph branches, with a compact model size (~0.607M). The results indicate that hyperbolic geometry better encodes semantic discrepancies between violent and normal events, enabling stronger discrimination and more reliable localization in video data, while maintaining computational efficiency and training stability via the Lorentz model.

Abstract

In recent years, the task of weakly supervised audio-visual violence detection has gained considerable attention. The goal of this task is to identify violent segments within multimodal data based on video-level labels. Despite advances in this field, traditional Euclidean neural networks, which have been used in prior research, encounter difficulties in capturing highly discriminative representations due to limitations of the feature space. To overcome this, we propose HyperVD, a novel framework that learns snippet embeddings in hyperbolic space to improve model discrimination. Our framework comprises a detour fusion module for multimodal fusion, effectively alleviating modality inconsistency between audio and visual signals. Additionally, we contribute two branches of fully hyperbolic graph convolutional networks that excavate feature similarities and temporal relationships among snippets in hyperbolic space. By learning snippet representations in this space, the framework effectively learns semantic discrepancies between violent and normal events. Extensive experiments on the XD-Violence benchmark demonstrate that our method outperforms state-of-the-art methods by a sizable margin.
Paper Structure (23 sections, 17 equations, 7 figures, 7 tables)

This paper contains 23 sections, 17 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Intuitively, there are implicit hierarchical relationships and substantial semantic discrepancies between violent instances and normal instances. These discrepancies can be difficult to capture using traditional Euclidean space methods, which may not be well-suited to represent complex hierarchical structures.
  • Figure 2: Overview of our HyperVD framework. Our approach consists of fours parts: detour fusion, hyperbolic feature similarity graph branch, hyperbolic temporal relation graph branch and hyperbolic classifier. Taking audio and visual features extracted from pretrained networks as inputs, we design a simple yet effective module to fuse audio-visual information. Then two hyperbolic graph branches learn instance representations via feature similarity and temporal relation in hyperbolic space. Finally, a hyperbolic classifier is deployed to predict violent scores for each instance. The entire framework is trained jointly in a weakly supervised manner, and we adopt the multiple instance learning (MIL) strategy for optimization.
  • Figure 3: Visualization of anomaly score curves. The horizontal axis represents the time, and the vertical axis represents the anomaly scores. The first row includes two samples of videos containing violent events, and the second row includes samples from normal videos. The blue curves indicate the predicted abnormal scores of the video frames, and the red areas indicate the locations of abnormal events.
  • Figure 4: Feature space visualizations of the vanilla features (left), the trained features via Euclidean space (middle), and trained features via hyperbolic space (right). All the results are performed on XD-Violence test set. Red dots represent non-violent features, and green dots denote violent features.
  • Figure 5: Ablative visualization of testing results on XD-Violence. The blue curves are predicted violent scores, and the "GT" bars in orange are ground truths of violent regions.
  • ...and 2 more figures