Table of Contents
Fetching ...

Anomize: Better Open Vocabulary Video Anomaly Detection

Fei Li, Wenxuan Liu, Jingjing Chen, Ruixu Zhang, Yuran Wang, Xian Zhong, Zheng Wang

TL;DR

Anomize tackles open vocabulary video anomaly detection by addressing detection ambiguity and categorization confusion through a Text-Augmented Dual Stream architecture. The dynamic stream leverages temporal features augmented with anomaly descriptions, while the static stream enriches scene-level features with a concept library; both streams are fused to produce robust frame-level anomaly scores and open-set predictions. A Group-Guided Text Encoding mechanism aligns labels by visual groups, guided by GPT-generated descriptions, improving multimodal alignment for novel anomalies. Two-stage training with targeted losses and segmented optimization yields strong performance on XD-Violence and UCF-Crime, particularly for novel categories, demonstrating practical gains for open-world safety scenarios.

Abstract

Open Vocabulary Video Anomaly Detection (OVVAD) seeks to detect and classify both base and novel anomalies. However, existing methods face two specific challenges related to novel anomalies. The first challenge is detection ambiguity, where the model struggles to assign accurate anomaly scores to unfamiliar anomalies. The second challenge is categorization confusion, where novel anomalies are often misclassified as visually similar base instances. To address these challenges, we explore supplementary information from multiple sources to mitigate detection ambiguity by leveraging multiple levels of visual data alongside matching textual information. Furthermore, we propose incorporating label relations to guide the encoding of new labels, thereby improving alignment between novel videos and their corresponding labels, which helps reduce categorization confusion. The resulting Anomize framework effectively tackles these issues, achieving superior performance on UCF-Crime and XD-Violence datasets, demonstrating its effectiveness in OVVAD.

Anomize: Better Open Vocabulary Video Anomaly Detection

TL;DR

Anomize tackles open vocabulary video anomaly detection by addressing detection ambiguity and categorization confusion through a Text-Augmented Dual Stream architecture. The dynamic stream leverages temporal features augmented with anomaly descriptions, while the static stream enriches scene-level features with a concept library; both streams are fused to produce robust frame-level anomaly scores and open-set predictions. A Group-Guided Text Encoding mechanism aligns labels by visual groups, guided by GPT-generated descriptions, improving multimodal alignment for novel anomalies. Two-stage training with targeted losses and segmented optimization yields strong performance on XD-Violence and UCF-Crime, particularly for novel categories, demonstrating practical gains for open-world safety scenarios.

Abstract

Open Vocabulary Video Anomaly Detection (OVVAD) seeks to detect and classify both base and novel anomalies. However, existing methods face two specific challenges related to novel anomalies. The first challenge is detection ambiguity, where the model struggles to assign accurate anomaly scores to unfamiliar anomalies. The second challenge is categorization confusion, where novel anomalies are often misclassified as visually similar base instances. To address these challenges, we explore supplementary information from multiple sources to mitigate detection ambiguity by leveraging multiple levels of visual data alongside matching textual information. Furthermore, we propose incorporating label relations to guide the encoding of new labels, thereby improving alignment between novel videos and their corresponding labels, which helps reduce categorization confusion. The resulting Anomize framework effectively tackles these issues, achieving superior performance on UCF-Crime and XD-Violence datasets, demonstrating its effectiveness in OVVAD.

Paper Structure

This paper contains 46 sections, 24 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Challenges Related to Novel Anomalies. (a) Detection ambiguity: The model struggles to assign accurate anomaly scores to unfamiliar frames containing novel anomalies. (b) Categorization confusion: Novel anomalies are misclassified as visually similar base instances from the training set.
  • Figure 2: Feature Visualization of Our Design. (a) Text augmentation shifts ambiguous frames to the anomalous feature space. In the static stream, text represents anomaly-related nouns (e.g., "abandoned fire starter"), while in the dynamic stream, it denotes label descriptions. (b) Group-guided text encoding improves the alignment of novel anomalies with novel labels, especially for those resembling base samples.
  • Figure 3: Overview of Our Anomize Framework. (a) Process for obtaining label features via the Group-Guided Text Encoding mechanism. (b) Creation of the concept library $\mathrm{ConceptLib}$ for anomaly detection. (c) The framework processes anomaly labels and video frames to generate frame-level anomaly scores and detected labels. Scoring is performed using a Text-Augmented Dual Stream mechanism, where each stream receives corresponding text and visual features, and the fused scores are produced as output. For labeling, the model aligns label features from the Group-Guided Text Encoding mechanism with the fused original and temporal visual encodings. Both the text and image encoders, pre-trained on CLIP, remain frozen without further optimization.
  • Figure 4: Qualitative Results for Anomaly Detection. The first and second rows present results on XD-Violence and UCF-Crime respectively. Red boxes and rectangles highlight the ground-truth anomalous frames, while blue lines represent predicted anomaly scores.
  • Figure 5: Similarity Matrices of Textual Encoding. (a) and (c) depict results using encodings from the original label data, while (b) and (d) show improvements achieved with the group-guided text encoding mechanism.
  • ...and 2 more figures