Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection
Yujiang Pu, Xiaoyu Wu, Lulu Yang, Shengjin Wang
TL;DR
The paper tackles weakly supervised video anomaly detection by designing efficient temporal context modeling and semantic enrichment. It introduces Temporal Context Aggregation (TCA), which reuses the similarity matrix with adaptive fusion and Dynamic Position Encoding to capture local and global context with fewer parameters, and Prompt-Enhanced Learning (PEL), which leverages knowledge-based prompts from ConceptNet and cross-modal alignment to improve fine-grained discriminability. Training combines MIL-based cross-entropy with a KL-divergence alignment term between visual and prompt distributions, enabling discriminative boundaries for anomaly sub-classes. Experiments on UCF-Crime, XD-Violence, and ShanghaiTech show competitive results and notable gains in fine-grained anomaly detection, with reduced computational cost and a publicly available implementation.
Abstract
Video anomaly detection under weak supervision presents significant challenges, particularly due to the lack of frame-level annotations during training. While prior research has utilized graph convolution networks and self-attention mechanisms alongside multiple instance learning (MIL)-based classification loss to model temporal relations and learn discriminative features, these methods often employ multi-branch architectures to capture local and global dependencies separately, resulting in increased parameters and computational costs. Moreover, the coarse-grained interclass separability provided by the binary constraint of MIL-based loss neglects the fine-grained discriminability within anomalous classes. In response, this paper introduces a weakly supervised anomaly detection framework that focuses on efficient context modeling and enhanced semantic discriminability. We present a Temporal Context Aggregation (TCA) module that captures comprehensive contextual information by reusing the similarity matrix and implementing adaptive fusion. Additionally, we propose a Prompt-Enhanced Learning (PEL) module that integrates semantic priors using knowledge-based prompts to boost the discriminative capacity of context features while ensuring separability between anomaly sub-classes. Extensive experiments validate the effectiveness of our method's components, demonstrating competitive performance with reduced parameters and computational effort on three challenging benchmarks: UCF-Crime, XD-Violence, and ShanghaiTech datasets. Notably, our approach significantly improves the detection accuracy of certain anomaly sub-classes, underscoring its practical value and efficacy. Our code is available at: https://github.com/yujiangpu20/PEL4VAD.
