Table of Contents
Fetching ...

Multi-scale Bottleneck Transformer for Weakly Supervised Multimodal Violence Detection

Shengyang Sun, Xiaojin Gong

TL;DR

This work tackles weakly supervised multimodal violence detection with video-level labels across RGB, flow, and audio streams. It introduces a multi-scale bottleneck transformer (MSBT) fusion module and a temporal consistency contrast (TCC) loss to address information redundancy, modality imbalance, and modality asynchrony, enabling effective pairwise fusion of modalities. The approach uses a fully transformer-based architecture with a MIL objective employing top-$K$ selection ($K=9$) and a TCC regularizer ($\tau=0.5$), achieving state-of-the-art AP on XD-Violence when all three modalities are used (RGB+Audio+Flow: 84.32%) and showing strong ablations. The method is extendable to additional modalities and offers practical benefits for robust multimodal violence detection in real-world settings.

Abstract

Weakly supervised multimodal violence detection aims to learn a violence detection model by leveraging multiple modalities such as RGB, optical flow, and audio, while only video-level annotations are available. In the pursuit of effective multimodal violence detection (MVD), information redundancy, modality imbalance, and modality asynchrony are identified as three key challenges. In this work, we propose a new weakly supervised MVD method that explicitly addresses these challenges. Specifically, we introduce a multi-scale bottleneck transformer (MSBT) based fusion module that employs a reduced number of bottleneck tokens to gradually condense information and fuse each pair of modalities and utilizes a bottleneck token-based weighting scheme to highlight more important fused features. Furthermore, we propose a temporal consistency contrast loss to semantically align pairwise fused features. Experiments on the largest-scale XD-Violence dataset demonstrate that the proposed method achieves state-of-the-art performance. Code is available at https://github.com/shengyangsun/MSBT.

Multi-scale Bottleneck Transformer for Weakly Supervised Multimodal Violence Detection

TL;DR

This work tackles weakly supervised multimodal violence detection with video-level labels across RGB, flow, and audio streams. It introduces a multi-scale bottleneck transformer (MSBT) fusion module and a temporal consistency contrast (TCC) loss to address information redundancy, modality imbalance, and modality asynchrony, enabling effective pairwise fusion of modalities. The approach uses a fully transformer-based architecture with a MIL objective employing top- selection () and a TCC regularizer (), achieving state-of-the-art AP on XD-Violence when all three modalities are used (RGB+Audio+Flow: 84.32%) and showing strong ablations. The method is extendable to additional modalities and offers practical benefits for robust multimodal violence detection in real-world settings.

Abstract

Weakly supervised multimodal violence detection aims to learn a violence detection model by leveraging multiple modalities such as RGB, optical flow, and audio, while only video-level annotations are available. In the pursuit of effective multimodal violence detection (MVD), information redundancy, modality imbalance, and modality asynchrony are identified as three key challenges. In this work, we propose a new weakly supervised MVD method that explicitly addresses these challenges. Specifically, we introduce a multi-scale bottleneck transformer (MSBT) based fusion module that employs a reduced number of bottleneck tokens to gradually condense information and fuse each pair of modalities and utilizes a bottleneck token-based weighting scheme to highlight more important fused features. Furthermore, we propose a temporal consistency contrast loss to semantically align pairwise fused features. Experiments on the largest-scale XD-Violence dataset demonstrate that the proposed method achieves state-of-the-art performance. Code is available at https://github.com/shengyangsun/MSBT.
Paper Structure (17 sections, 11 equations, 4 figures, 4 tables)

This paper contains 17 sections, 11 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: An illustration of our multimodal fusion module. It consists of a multi-scale bottleneck transformer and a bottleneck token-based weighting scheme. When a pair of modalities are input, the bottleneck tokens first condense the information of modality $\mathtt{a}$ and then transmit it to modality $\mathtt{b}$ at each layer. Moreover, the bottleneck tokens condensed at a layer are passed to the subsequent layer for further condensation. The token at the final layer can be used to measure the quantity of information transmitted and is therefore leveraged to weight the fused feature. Best viewed in color.
  • Figure 2: An overview of the proposed framework. It includes three unimodal encoders, a multimodal fusion module, and a global encoder for multimodal feature generation. Each unimodal encoder consists of a modality-specific feature extraction backbone and a linear projection layer for tokenization and a modality-shared transformer for context aggregation within one modality. The fusion module contains a multi-scale bottleneck transformer (MSBT) to fuse any pair of modalities and a sub-module to weight concatenated fused features. The global encoder, implemented by a transformer, aggregates context over all modalities. Finally, the produced multimodal features are fed into a regressor to predict anomaly scores. The entire network is learned with a multiple instance learning (MIL) loss $\mathcal{L}_{MIL}$, together with a temporal consistency contrast (TCC) loss $\mathcal{L}_{TCC}$. Best viewed in color.
  • Figure 3: Visualization of anomaly scores predicted on the XD-Violence test set. The red regions indicate ground-truth violent events and the blue lines are anomaly scores predicted by our method. Best viewed in color.
  • Figure 4: The performance evaluation with a different number of tokens at the first layer of our multi-scale bottleneck transformer. Best viewed in color.