Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance
Shengyang Sun, Jiashen Hua, Junyi Feng, Xiaojin Gong
TL;DR
This work tackles the under-utilization of text in weakly supervised multimodal video anomaly detection by proposing TG-MVAD, a text-guided framework that combines a multi-stage text augmentation (MSTA) pipeline with a multi-scale bottleneck Transformer (MSBT) for robust fusion. MSTA uses an in-context learning (ICL) based three-stage process to generate high-quality anomalous text samples, enabling effective fine-tuning of a text feature extractor, while MSBT progressively fuses RGB, flow, audio, and text modalities through bottleneck tokens and a weighting scheme. Empirical results on UCF-Crime and XD-Violence show state-of-the-art performance, with ablations confirming the value of text guidance, staged augmentation, and the MSBT fusion design. The framework also offers improved explainability by associating anomaly cues with text and visual modalities, accelerating practical deployment in surveillance contexts where annotation effort is limited and reliable detection is critical.
Abstract
Weakly supervised multimodal video anomaly detection has gained significant attention, yet the potential of the text modality remains under-explored. Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms. However, extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances and the scarcity of relevant descriptions. Furthermore, multimodal fusion often suffers from redundancy and imbalance. To address these issues, we propose a novel text-guided framework. First, we introduce an in-context learning-based multi-stage text augmentation mechanism to generate high-quality anomaly text samples for fine-tuning the text feature extractor. Second, we design a multi-scale bottleneck Transformer fusion module that uses compressed bottleneck tokens to progressively integrate information across modalities, mitigating redundancy and imbalance. Experiments on UCF-Crime and XD-Violence demonstrate state-of-the-art performance.
