Table of Contents
Fetching ...

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

Zhiwei Yang, Jing Liu, Peng Wu

TL;DR

The paper tackles weakly supervised video anomaly detection by leveraging textual event descriptions through CLIP to generate fine-grained pseudo-labels. It introduces TPWNG, which fine-tunes CLIP via ranking losses $L_{rank}^n$, $L_{rank}^a$ and a distributional inconsistency loss $L_{dil}$, augmented with a learnable text prompt and a normality visual prompt to improve text-video alignment. A Pseudo Label Generation (PLG) module guided by normality and a Temporal Context Self-Adaptive Learning (TCSAL) module jointly enable flexible temporal modeling and more accurate frame-level anomaly labeling, followed by end-to-end self-training. Experiments on UCF-Crime and XD-Violence demonstrate state-of-the-art performance, highlighting the practical impact of cross-modal prompts and adaptive temporal context for robust WSVAD.

Abstract

Weakly supervised video anomaly detection (WSVAD) is a challenging task. Generating fine-grained pseudo-labels based on weak-label and then self-training a classifier is currently a promising solution. However, since the existing methods use only RGB visual modality and the utilization of category text information is neglected, thus limiting the generation of more accurate pseudo-labels and affecting the performance of self-training. Inspired by the manual labeling process based on the event description, in this paper, we propose a novel pseudo-label generation and self-training framework based on Text Prompt with Normality Guidance (TPWNG) for WSVAD. Our idea is to transfer the rich language-visual knowledge of the contrastive language-image pre-training (CLIP) model for aligning the video event description text and corresponding video frames to generate pseudo-labels. Specifically, We first fine-tune the CLIP for domain adaptation by designing two ranking losses and a distributional inconsistency loss. Further, we propose a learnable text prompt mechanism with the assist of a normality visual prompt to further improve the matching accuracy of video event description text and video frames. Then, we design a pseudo-label generation module based on the normality guidance to infer reliable frame-level pseudo-labels. Finally, we introduce a temporal context self-adaptive learning module to learn the temporal dependencies of different video events more flexibly and accurately. Extensive experiments show that our method achieves state-of-the-art performance on two benchmark datasets, UCF-Crime and XD-Viole

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

TL;DR

The paper tackles weakly supervised video anomaly detection by leveraging textual event descriptions through CLIP to generate fine-grained pseudo-labels. It introduces TPWNG, which fine-tunes CLIP via ranking losses , and a distributional inconsistency loss , augmented with a learnable text prompt and a normality visual prompt to improve text-video alignment. A Pseudo Label Generation (PLG) module guided by normality and a Temporal Context Self-Adaptive Learning (TCSAL) module jointly enable flexible temporal modeling and more accurate frame-level anomaly labeling, followed by end-to-end self-training. Experiments on UCF-Crime and XD-Violence demonstrate state-of-the-art performance, highlighting the practical impact of cross-modal prompts and adaptive temporal context for robust WSVAD.

Abstract

Weakly supervised video anomaly detection (WSVAD) is a challenging task. Generating fine-grained pseudo-labels based on weak-label and then self-training a classifier is currently a promising solution. However, since the existing methods use only RGB visual modality and the utilization of category text information is neglected, thus limiting the generation of more accurate pseudo-labels and affecting the performance of self-training. Inspired by the manual labeling process based on the event description, in this paper, we propose a novel pseudo-label generation and self-training framework based on Text Prompt with Normality Guidance (TPWNG) for WSVAD. Our idea is to transfer the rich language-visual knowledge of the contrastive language-image pre-training (CLIP) model for aligning the video event description text and corresponding video frames to generate pseudo-labels. Specifically, We first fine-tune the CLIP for domain adaptation by designing two ranking losses and a distributional inconsistency loss. Further, we propose a learnable text prompt mechanism with the assist of a normality visual prompt to further improve the matching accuracy of video event description text and video frames. Then, we design a pseudo-label generation module based on the normality guidance to infer reliable frame-level pseudo-labels. Finally, we introduce a temporal context self-adaptive learning module to learn the temporal dependencies of different video events more flexibly and accurately. Extensive experiments show that our method achieves state-of-the-art performance on two benchmark datasets, UCF-Crime and XD-Viole
Paper Structure (28 sections, 14 equations, 8 figures, 7 tables)

This paper contains 28 sections, 14 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Illustration of the manual video frame labeling process.
  • Figure 2: The overall architecture of our proposed TPWNG.
  • Figure 3: Anomaly score curves of several test samples on the UCF-Crime and XD-Violence dataset.
  • Figure 4: The shape of the soft mask function ${\chi _z}(h)$.
  • Figure 5: The AUC and AP change of our method on the UCF-Crime and XD-Violence datasets with different normality guidance weight $\alpha$.
  • ...and 3 more figures