Table of Contents
Fetching ...

Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection

Chenchen Tao, Xiaohao Peng, Chong Wang, Jiafei Wu, Puning Zhao, Jun Wang, Jiangbo Qian

TL;DR

A novel framework is proposed to guide the learning of suspected anomalies from event prompts and enables a new multi-prompt learning process to constrain the visual-semantic features across all videos, as well as provides a new way to label pseudo anomalies for self-training.

Abstract

Most models for weakly supervised video anomaly detection (WS-VAD) rely on multiple instance learning, aiming to distinguish normal and abnormal snippets without specifying the type of anomaly. However, the ambiguous nature of anomaly definitions across contexts may introduce inaccuracy in discriminating abnormal and normal events. To show the model what is anomalous, a novel framework is proposed to guide the learning of suspected anomalies from event prompts. Given a textual prompt dictionary of potential anomaly events and the captions generated from anomaly videos, the semantic anomaly similarity between them could be calculated to identify the suspected events for each video snippet. It enables a new multi-prompt learning process to constrain the visual-semantic features across all videos, as well as provides a new way to label pseudo anomalies for self-training. To demonstrate its effectiveness, comprehensive experiments and detailed ablation studies are conducted on four datasets, namely XD-Violence, UCF-Crime, TAD, and ShanghaiTech. Our proposed model outperforms most state-of-the-art methods in terms of AP or AUC (86.5\%, \hl{90.4}\%, 94.4\%, and 97.4\%). Furthermore, it shows promising performance in open-set and cross-dataset cases. The data, code, and models can be found at: \url{https://github.com/shiwoaz/lap}.

Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection

TL;DR

A novel framework is proposed to guide the learning of suspected anomalies from event prompts and enables a new multi-prompt learning process to constrain the visual-semantic features across all videos, as well as provides a new way to label pseudo anomalies for self-training.

Abstract

Most models for weakly supervised video anomaly detection (WS-VAD) rely on multiple instance learning, aiming to distinguish normal and abnormal snippets without specifying the type of anomaly. However, the ambiguous nature of anomaly definitions across contexts may introduce inaccuracy in discriminating abnormal and normal events. To show the model what is anomalous, a novel framework is proposed to guide the learning of suspected anomalies from event prompts. Given a textual prompt dictionary of potential anomaly events and the captions generated from anomaly videos, the semantic anomaly similarity between them could be calculated to identify the suspected events for each video snippet. It enables a new multi-prompt learning process to constrain the visual-semantic features across all videos, as well as provides a new way to label pseudo anomalies for self-training. To demonstrate its effectiveness, comprehensive experiments and detailed ablation studies are conducted on four datasets, namely XD-Violence, UCF-Crime, TAD, and ShanghaiTech. Our proposed model outperforms most state-of-the-art methods in terms of AP or AUC (86.5\%, \hl{90.4}\%, 94.4\%, and 97.4\%). Furthermore, it shows promising performance in open-set and cross-dataset cases. The data, code, and models can be found at: \url{https://github.com/shiwoaz/lap}.
Paper Structure (16 sections, 11 equations, 7 figures, 6 tables)

This paper contains 16 sections, 11 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: The difference between the traditional multiple instance learning methods (upper) and our model (lower). The former one only learns the anomalies using top-$k$ scores in each abnormal video, while the latter utilizes a prompt dictionary to provide extra guidance across different videos.
  • Figure 2: The overview of the proposed LAP framework. Synthetic features, as input to score predictors, are generated through the visual and semantic feature extractors. A prompt dictionary is used to produce the anomaly matrix and vector, which is employed to perform multi-prompt learning (MPL) and pseudo anomaly labeling (PAL) across different videos.
  • Figure 3: Visualization of the proposed anomaly matrix $\Psi^\top$. It is truncated due to the limited column width.
  • Figure 4: Qualitative comparisons of TEVAD chen2023tevad and our method on both UCF-Crime (UCF) and XD-Violence (XD). The ground truth of anomalous events is represented by light red regions.
  • Figure 5: The distribution of matched suspected anomalies in the UCF-Crime (upper) and TAD (lower) datasets.
  • ...and 2 more figures