Table of Contents
Fetching ...

Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance

Shengyang Sun, Jiashen Hua, Junyi Feng, Xiaojin Gong

TL;DR

This work tackles the under-utilization of text in weakly supervised multimodal video anomaly detection by proposing TG-MVAD, a text-guided framework that combines a multi-stage text augmentation (MSTA) pipeline with a multi-scale bottleneck Transformer (MSBT) for robust fusion. MSTA uses an in-context learning (ICL) based three-stage process to generate high-quality anomalous text samples, enabling effective fine-tuning of a text feature extractor, while MSBT progressively fuses RGB, flow, audio, and text modalities through bottleneck tokens and a weighting scheme. Empirical results on UCF-Crime and XD-Violence show state-of-the-art performance, with ablations confirming the value of text guidance, staged augmentation, and the MSBT fusion design. The framework also offers improved explainability by associating anomaly cues with text and visual modalities, accelerating practical deployment in surveillance contexts where annotation effort is limited and reliable detection is critical.

Abstract

Weakly supervised multimodal video anomaly detection has gained significant attention, yet the potential of the text modality remains under-explored. Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms. However, extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances and the scarcity of relevant descriptions. Furthermore, multimodal fusion often suffers from redundancy and imbalance. To address these issues, we propose a novel text-guided framework. First, we introduce an in-context learning-based multi-stage text augmentation mechanism to generate high-quality anomaly text samples for fine-tuning the text feature extractor. Second, we design a multi-scale bottleneck Transformer fusion module that uses compressed bottleneck tokens to progressively integrate information across modalities, mitigating redundancy and imbalance. Experiments on UCF-Crime and XD-Violence demonstrate state-of-the-art performance.

Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance

TL;DR

This work tackles the under-utilization of text in weakly supervised multimodal video anomaly detection by proposing TG-MVAD, a text-guided framework that combines a multi-stage text augmentation (MSTA) pipeline with a multi-scale bottleneck Transformer (MSBT) for robust fusion. MSTA uses an in-context learning (ICL) based three-stage process to generate high-quality anomalous text samples, enabling effective fine-tuning of a text feature extractor, while MSBT progressively fuses RGB, flow, audio, and text modalities through bottleneck tokens and a weighting scheme. Empirical results on UCF-Crime and XD-Violence show state-of-the-art performance, with ablations confirming the value of text guidance, staged augmentation, and the MSBT fusion design. The framework also offers improved explainability by associating anomaly cues with text and visual modalities, accelerating practical deployment in surveillance contexts where annotation effort is limited and reliable detection is critical.

Abstract

Weakly supervised multimodal video anomaly detection has gained significant attention, yet the potential of the text modality remains under-explored. Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms. However, extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances and the scarcity of relevant descriptions. Furthermore, multimodal fusion often suffers from redundancy and imbalance. To address these issues, we propose a novel text-guided framework. First, we introduce an in-context learning-based multi-stage text augmentation mechanism to generate high-quality anomaly text samples for fine-tuning the text feature extractor. Second, we design a multi-scale bottleneck Transformer fusion module that uses compressed bottleneck tokens to progressively integrate information across modalities, mitigating redundancy and imbalance. Experiments on UCF-Crime and XD-Violence demonstrate state-of-the-art performance.
Paper Structure (43 sections, 22 equations, 11 figures, 6 tables)

This paper contains 43 sections, 22 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: (a) Previous methods for multimodal video anomaly detection (VAD) have demonstrated improved performance over unimodal VAD, however, they have inadequately explored the text modality and lack explainability. In contrast, we propose a text-guided multimodal VAD method that not only achieves superior performance but also provides enhanced explainability. (b) Using the UCF-Crime dataset as a case study, we observe that only 12% of the text samples are classified as anomalies. This imbalance in the dataset introduces bias during the feature extractor fine-tuning. To address this issue, we propose the multi-stage text augmentation (MSTA) approach, which generates a greater number of high-quality anomaly samples to address the challenge of sample imbalance.
  • Figure 2: Illustration of (a$\sim$c) the proposed multi-stage text augmentation (MSTA) for (d) extractor fine-tuning and feature extraction. (a) Stage-I: We use a large language model (LLM) to summarize all captions in the videos, obtaining labeled text samples. (b) Stage-II: based on the summarized captions, we utilize ICL to generate pseudo-labels for each caption within the video. (c) Stage-III: We employ the labeled samples from the previous two stages, using ICL to generate new anomalous samples. (d) We fine-tune the feature extractor using both the original and generated samples to obtain high-quality text representation features.
  • Figure 3: An overview of the proposed framework. It includes three unimodal encoders, a multi-scale bottleneck Transformer, and a global encoder for multimodal feature generation. Each unimodal encoder consists of a modality-specific feature extractor and a linear projection layer for tokenization and a modality-shared Transformer for context aggregation within one modality. The multi-scale bottleneck Transformer (MSBT) fuses any pair of modalities and a sub-module to weight concatenated fused features. The global encoder, implemented by a Transformer, aggregates context overall snippets. Finally, the final anomaly score is constructed by combining the anomaly scores from the fused features and the anomaly probabilities from the text.
  • Figure 4: The performance with a variant number of samplings $N_S$ in the Stage-II of MSTA. Best viewed in color.
  • Figure 5: The performance with a variant number of context samplings $N_R$ in the Stage-II(III) of MSTA. Best viewed in color.
  • ...and 6 more figures