Table of Contents
Fetching ...

Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection

Yujiang Pu, Xiaoyu Wu, Lulu Yang, Shengjin Wang

TL;DR

The paper tackles weakly supervised video anomaly detection by designing efficient temporal context modeling and semantic enrichment. It introduces Temporal Context Aggregation (TCA), which reuses the similarity matrix with adaptive fusion and Dynamic Position Encoding to capture local and global context with fewer parameters, and Prompt-Enhanced Learning (PEL), which leverages knowledge-based prompts from ConceptNet and cross-modal alignment to improve fine-grained discriminability. Training combines MIL-based cross-entropy with a KL-divergence alignment term between visual and prompt distributions, enabling discriminative boundaries for anomaly sub-classes. Experiments on UCF-Crime, XD-Violence, and ShanghaiTech show competitive results and notable gains in fine-grained anomaly detection, with reduced computational cost and a publicly available implementation.

Abstract

Video anomaly detection under weak supervision presents significant challenges, particularly due to the lack of frame-level annotations during training. While prior research has utilized graph convolution networks and self-attention mechanisms alongside multiple instance learning (MIL)-based classification loss to model temporal relations and learn discriminative features, these methods often employ multi-branch architectures to capture local and global dependencies separately, resulting in increased parameters and computational costs. Moreover, the coarse-grained interclass separability provided by the binary constraint of MIL-based loss neglects the fine-grained discriminability within anomalous classes. In response, this paper introduces a weakly supervised anomaly detection framework that focuses on efficient context modeling and enhanced semantic discriminability. We present a Temporal Context Aggregation (TCA) module that captures comprehensive contextual information by reusing the similarity matrix and implementing adaptive fusion. Additionally, we propose a Prompt-Enhanced Learning (PEL) module that integrates semantic priors using knowledge-based prompts to boost the discriminative capacity of context features while ensuring separability between anomaly sub-classes. Extensive experiments validate the effectiveness of our method's components, demonstrating competitive performance with reduced parameters and computational effort on three challenging benchmarks: UCF-Crime, XD-Violence, and ShanghaiTech datasets. Notably, our approach significantly improves the detection accuracy of certain anomaly sub-classes, underscoring its practical value and efficacy. Our code is available at: https://github.com/yujiangpu20/PEL4VAD.

Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection

TL;DR

The paper tackles weakly supervised video anomaly detection by designing efficient temporal context modeling and semantic enrichment. It introduces Temporal Context Aggregation (TCA), which reuses the similarity matrix with adaptive fusion and Dynamic Position Encoding to capture local and global context with fewer parameters, and Prompt-Enhanced Learning (PEL), which leverages knowledge-based prompts from ConceptNet and cross-modal alignment to improve fine-grained discriminability. Training combines MIL-based cross-entropy with a KL-divergence alignment term between visual and prompt distributions, enabling discriminative boundaries for anomaly sub-classes. Experiments on UCF-Crime, XD-Violence, and ShanghaiTech show competitive results and notable gains in fine-grained anomaly detection, with reduced computational cost and a publicly available implementation.

Abstract

Video anomaly detection under weak supervision presents significant challenges, particularly due to the lack of frame-level annotations during training. While prior research has utilized graph convolution networks and self-attention mechanisms alongside multiple instance learning (MIL)-based classification loss to model temporal relations and learn discriminative features, these methods often employ multi-branch architectures to capture local and global dependencies separately, resulting in increased parameters and computational costs. Moreover, the coarse-grained interclass separability provided by the binary constraint of MIL-based loss neglects the fine-grained discriminability within anomalous classes. In response, this paper introduces a weakly supervised anomaly detection framework that focuses on efficient context modeling and enhanced semantic discriminability. We present a Temporal Context Aggregation (TCA) module that captures comprehensive contextual information by reusing the similarity matrix and implementing adaptive fusion. Additionally, we propose a Prompt-Enhanced Learning (PEL) module that integrates semantic priors using knowledge-based prompts to boost the discriminative capacity of context features while ensuring separability between anomaly sub-classes. Extensive experiments validate the effectiveness of our method's components, demonstrating competitive performance with reduced parameters and computational effort on three challenging benchmarks: UCF-Crime, XD-Violence, and ShanghaiTech datasets. Notably, our approach significantly improves the detection accuracy of certain anomaly sub-classes, underscoring its practical value and efficacy. Our code is available at: https://github.com/yujiangpu20/PEL4VAD.
Paper Structure (40 sections, 20 equations, 9 figures, 14 tables)

This paper contains 40 sections, 20 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Comparison of different methods for prompt construction. (a) Hand-crafted prompt. (b) Learnable prompt. (c) Knowledge-based prompt (Ours). The context features are enhanced by cross-modal alignment with the corresponding prompt features.
  • Figure 2: Overview of the proposed framework. We first process untrimmed videos using a pre-trained I3D network to extract snippet features. The TCA module then simultaneously captures both local and global contexts. High-level representations are derived using a two-layer MLP, with the PEL module operating in the middle layer to learn fine-grained semantics for context features. Finally, a causal convolution layer functions as a classifier to predict snippet-level anomaly scores. During the training stage, the model is optimized using both cross entropy loss $\mathscr{L}_{ce}$ and KL-divergence loss $\mathscr{L}_{kd}$.
  • Figure 3: An example of concept dictionary given the anomaly class fighting. The arrows point from the head node to the tail node with relevance scores, and the colors indicate different relationships. The entire graph constitutes a concept dictionary, where the bold items are those retained after node filtering.
  • Figure 4: Visualization of cosine similarity of context features. (a) and (b) are videos from the UCF-Crime dataset. (c) and (d) are videos from the XD-Violence dataset.
  • Figure 5: Contribution of the PEL module to fine-grained anomaly detection.
  • ...and 4 more figures