Table of Contents
Fetching ...

AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM

Sunghyun Ahn, Youngwan Jo, Kijung Lee, Sein Kwon, Inpyo Hong, Sanghyun Park

TL;DR

The paper tackles the generalization gap in video anomaly detection by introducing customizable VAD (C-VAD) and a zero-shot AnyAnomaly model that leverages large vision-language models without fine-tuning. It employs a segment-level pipeline with a key frames selection module and context generation (Position Context and Temporal Context) to enable context-aware VQA for anomaly scoring, using a late-fusion of LVLM outputs guided by user-defined text. Experiments on standard VAD benchmarks and newly constructed C-VAD datasets show improved performance, with state-of-the-art results on UBnormal and UCF-Crime in a zero-shot setting and strong cross-dataset generalization. The approach emphasizes real-world applicability by reducing training requirements and latency, while providing insights into the roles of object-centric and motion-centric context in LVLM-based anomaly detection.

Abstract

Video anomaly detection (VAD) is crucial for video analysis and surveillance in computer vision. However, existing VAD models rely on learned normal patterns, which makes them difficult to apply to diverse environments. Consequently, users should retrain models or develop separate AI models for new environments, which requires expertise in machine learning, high-performance hardware, and extensive data collection, limiting the practical usability of VAD. To address these challenges, this study proposes customizable video anomaly detection (C-VAD) technique and the AnyAnomaly model. C-VAD considers user-defined text as an abnormal event and detects frames containing a specified event in a video. We effectively implemented AnyAnomaly using a context-aware visual question answering without fine-tuning the large vision language model. To validate the effectiveness of the proposed model, we constructed C-VAD datasets and demonstrated the superiority of AnyAnomaly. Furthermore, our approach showed competitive results on VAD benchmarks, achieving state-of-the-art performance on UBnormal and UCF-Crime and surpassing other methods in generalization across all datasets. Our code is available online at github.com/SkiddieAhn/Paper-AnyAnomaly.

AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM

TL;DR

The paper tackles the generalization gap in video anomaly detection by introducing customizable VAD (C-VAD) and a zero-shot AnyAnomaly model that leverages large vision-language models without fine-tuning. It employs a segment-level pipeline with a key frames selection module and context generation (Position Context and Temporal Context) to enable context-aware VQA for anomaly scoring, using a late-fusion of LVLM outputs guided by user-defined text. Experiments on standard VAD benchmarks and newly constructed C-VAD datasets show improved performance, with state-of-the-art results on UBnormal and UCF-Crime in a zero-shot setting and strong cross-dataset generalization. The approach emphasizes real-world applicability by reducing training requirements and latency, while providing insights into the roles of object-centric and motion-centric context in LVLM-based anomaly detection.

Abstract

Video anomaly detection (VAD) is crucial for video analysis and surveillance in computer vision. However, existing VAD models rely on learned normal patterns, which makes them difficult to apply to diverse environments. Consequently, users should retrain models or develop separate AI models for new environments, which requires expertise in machine learning, high-performance hardware, and extensive data collection, limiting the practical usability of VAD. To address these challenges, this study proposes customizable video anomaly detection (C-VAD) technique and the AnyAnomaly model. C-VAD considers user-defined text as an abnormal event and detects frames containing a specified event in a video. We effectively implemented AnyAnomaly using a context-aware visual question answering without fine-tuning the large vision language model. To validate the effectiveness of the proposed model, we constructed C-VAD datasets and demonstrated the superiority of AnyAnomaly. Furthermore, our approach showed competitive results on VAD benchmarks, achieving state-of-the-art performance on UBnormal and UCF-Crime and surpassing other methods in generalization across all datasets. Our code is available online at github.com/SkiddieAhn/Paper-AnyAnomaly.

Paper Structure

This paper contains 33 sections, 8 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Comparison of traditional video Anomaly Detection (VAD) and customizable video anomaly detection (C-VAD). Traditional VAD models struggle with generalization, making them hard to apply in diverse environments, while C-VAD can handle various video environments.
  • Figure 2: The architecture of AnyAnomaly
  • Figure 3: Architecture of the proposed modules. KSM is essential for the segment-level approach, and WA and GIG are crucial for context generation.
  • Figure 4: Proposed prompt for VQA
  • Figure 5: Comparison between the VAD and C-VAD datasets
  • ...and 7 more figures