Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

Zhiwei Yang; Jing Liu; Peng Wu

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

Zhiwei Yang, Jing Liu, Peng Wu

TL;DR

The paper tackles weakly supervised video anomaly detection by leveraging textual event descriptions through CLIP to generate fine-grained pseudo-labels. It introduces TPWNG, which fine-tunes CLIP via ranking losses $L_{rank}^n$, $L_{rank}^a$ and a distributional inconsistency loss $L_{dil}$, augmented with a learnable text prompt and a normality visual prompt to improve text-video alignment. A Pseudo Label Generation (PLG) module guided by normality and a Temporal Context Self-Adaptive Learning (TCSAL) module jointly enable flexible temporal modeling and more accurate frame-level anomaly labeling, followed by end-to-end self-training. Experiments on UCF-Crime and XD-Violence demonstrate state-of-the-art performance, highlighting the practical impact of cross-modal prompts and adaptive temporal context for robust WSVAD.

Abstract

Weakly supervised video anomaly detection (WSVAD) is a challenging task. Generating fine-grained pseudo-labels based on weak-label and then self-training a classifier is currently a promising solution. However, since the existing methods use only RGB visual modality and the utilization of category text information is neglected, thus limiting the generation of more accurate pseudo-labels and affecting the performance of self-training. Inspired by the manual labeling process based on the event description, in this paper, we propose a novel pseudo-label generation and self-training framework based on Text Prompt with Normality Guidance (TPWNG) for WSVAD. Our idea is to transfer the rich language-visual knowledge of the contrastive language-image pre-training (CLIP) model for aligning the video event description text and corresponding video frames to generate pseudo-labels. Specifically, We first fine-tune the CLIP for domain adaptation by designing two ranking losses and a distributional inconsistency loss. Further, we propose a learnable text prompt mechanism with the assist of a normality visual prompt to further improve the matching accuracy of video event description text and video frames. Then, we design a pseudo-label generation module based on the normality guidance to infer reliable frame-level pseudo-labels. Finally, we introduce a temporal context self-adaptive learning module to learn the temporal dependencies of different video events more flexibly and accurately. Extensive experiments show that our method achieves state-of-the-art performance on two benchmark datasets, UCF-Crime and XD-Viole

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

TL;DR

and a distributional inconsistency loss

, augmented with a learnable text prompt and a normality visual prompt to improve text-video alignment. A Pseudo Label Generation (PLG) module guided by normality and a Temporal Context Self-Adaptive Learning (TCSAL) module jointly enable flexible temporal modeling and more accurate frame-level anomaly labeling, followed by end-to-end self-training. Experiments on UCF-Crime and XD-Violence demonstrate state-of-the-art performance, highlighting the practical impact of cross-modal prompts and adaptive temporal context for robust WSVAD.

Abstract

Paper Structure (28 sections, 14 equations, 8 figures, 7 tables)

This paper contains 28 sections, 14 equations, 8 figures, 7 tables.

Introduction
Related Work
Video Anomaly Detection
Large Vision-Language Models
Method
Overall Architecture
Text and Normality Visual Prompt
Pseudo Label Generation Module
Temporal Context Self-adaptive Learning
Objective Function
Experiments
Datasets and Evaluation Metrics
Implementation Details
Comparison with State-of-the-art Methods
Ablation Studies
...and 13 more sections

Figures (8)

Figure 1: Illustration of the manual video frame labeling process.
Figure 2: The overall architecture of our proposed TPWNG.
Figure 3: Anomaly score curves of several test samples on the UCF-Crime and XD-Violence dataset.
Figure 4: The shape of the soft mask function ${\chi _z}(h)$.
Figure 5: The AUC and AP change of our method on the UCF-Crime and XD-Violence datasets with different normality guidance weight $\alpha$.
...and 3 more figures

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

TL;DR

Abstract

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (8)