Table of Contents
Fetching ...

Aligning First, Then Fusing: A Novel Weakly Supervised Multimodal Violence Detection Method

Wenping Jin, Li Zhu, Jing Sun

TL;DR

This work tackles weakly supervised multimodal violence detection by proposing an alignment-first paradigm that learns Modality-wise Feature Matching Subspaces (MFMS) to sparsely map audio and flow semantics into the RGB space. By iteratively identifying MFMSs and aligning modality features at the semantic level, the method enables more effective fusion via a simple Linear+TCN architecture and targeted losses, including MIL and a Triplet constraint. Empirical results on XD-Violence show state-of-the-art frame-level AP of $86.07\%$, with strong gains over unimodal and other multimodal baselines, and robust performance under modality dropout and varying training sizes. The approach emphasizes semantic consistency over low-level fusion tricks, offering a scalable and interpretable framework for weakly supervised multimodal video understanding in security and moderation contexts.

Abstract

Weakly supervised violence detection refers to the technique of training models to identify violent segments in videos using only video-level labels. Among these approaches, multimodal violence detection, which integrates modalities such as audio and optical flow, holds great potential. Existing methods in this domain primarily focus on designing multimodal fusion models to address modality discrepancies. In contrast, we take a different approach; leveraging the inherent discrepancies across modalities in violence event representation to propose a novel multimodal semantic feature alignment method. This method sparsely maps the semantic features of local, transient, and less informative modalities ( such as audio and optical flow ) into the more informative RGB semantic feature space. Through an iterative process, the method identifies the suitable no-zero feature matching subspace and aligns the modality-specific event representations based on this subspace, enabling the full exploitation of information from all modalities during the subsequent modality fusion stage. Building on this, we design a new weakly supervised violence detection framework that consists of unimodal multiple-instance learning for extracting unimodal semantic features, multimodal alignment, multimodal fusion, and final detection. Experimental results on benchmark datasets demonstrate the effectiveness of our method, achieving an average precision (AP) of 86.07% on the XD-Violence dataset. Our code is available at https://github.com/xjpp2016/MAVD.

Aligning First, Then Fusing: A Novel Weakly Supervised Multimodal Violence Detection Method

TL;DR

This work tackles weakly supervised multimodal violence detection by proposing an alignment-first paradigm that learns Modality-wise Feature Matching Subspaces (MFMS) to sparsely map audio and flow semantics into the RGB space. By iteratively identifying MFMSs and aligning modality features at the semantic level, the method enables more effective fusion via a simple Linear+TCN architecture and targeted losses, including MIL and a Triplet constraint. Empirical results on XD-Violence show state-of-the-art frame-level AP of , with strong gains over unimodal and other multimodal baselines, and robust performance under modality dropout and varying training sizes. The approach emphasizes semantic consistency over low-level fusion tricks, offering a scalable and interpretable framework for weakly supervised multimodal video understanding in security and moderation contexts.

Abstract

Weakly supervised violence detection refers to the technique of training models to identify violent segments in videos using only video-level labels. Among these approaches, multimodal violence detection, which integrates modalities such as audio and optical flow, holds great potential. Existing methods in this domain primarily focus on designing multimodal fusion models to address modality discrepancies. In contrast, we take a different approach; leveraging the inherent discrepancies across modalities in violence event representation to propose a novel multimodal semantic feature alignment method. This method sparsely maps the semantic features of local, transient, and less informative modalities ( such as audio and optical flow ) into the more informative RGB semantic feature space. Through an iterative process, the method identifies the suitable no-zero feature matching subspace and aligns the modality-specific event representations based on this subspace, enabling the full exploitation of information from all modalities during the subsequent modality fusion stage. Building on this, we design a new weakly supervised violence detection framework that consists of unimodal multiple-instance learning for extracting unimodal semantic features, multimodal alignment, multimodal fusion, and final detection. Experimental results on benchmark datasets demonstrate the effectiveness of our method, achieving an average precision (AP) of 86.07% on the XD-Violence dataset. Our code is available at https://github.com/xjpp2016/MAVD.
Paper Structure (18 sections, 23 equations, 11 figures, 6 tables, 2 algorithms)

This paper contains 18 sections, 23 equations, 11 figures, 6 tables, 2 algorithms.

Figures (11)

  • Figure 1: (a) An illustration of Searching for the Modality-wise Feature Matching Subspace(MFMS) and aligning features. In each iteration, we compute the pairwise similarity between audio and visual feature dimensions, select the most matching visual feature dimensions as MFMS. The audio features are mapped into the MFMS, forming sparse features, which are then aligned with the visual features. (b) Visualization of the MFMS convergence process. The red bars represent whether a particular visual feature dimension is identified as part of the MFMS over 50 iterations, with darker colors indicating more frequent identification. It can be observed that at the beginning of the iterations, most dimensions are considered part of the MFMS, and after some iterations, the MFMS converges to a small set of dimensions.
  • Figure 2: An overview of the proposed framework. It includes three stages: 1. Unimodal MIL, this stage focuses on training the encoders for each modality using MIL loss, with the aim of extracting the most relevant semantic features for the VD task. 2. Multimodal alignement, in this stage, our proposed method searches for the MFMSs and aligns the semantic features of different modalities based on the identified MFMSs. 3. Multimodal Fusion and final VD, this stage utilizes a multimodal encoder to fuse the aligned modality features and trains the model using both MIL loss and a specially designed Triplet Loss tailored for the VD task.
  • Figure 3: By searching for MFMSs, the entire RGB feature space is divided into four distinct parts: RGB-Audio-Flow MFMS, RGB-Audio MFMS, RGB-Flow MFMS, and pure RGB.
  • Figure 4: Overview of our inference process. The process comprises two stages: Stage 1 Feature Extraction of each Modality. Stage 2, Fusion multimodal featrues and calculate violent score.
  • Figure 5: Performance across different audio and flow dimensions. The AP ($\%$) values are visualized as a heatmap, where deeper blue represents higher AP values.
  • ...and 6 more figures