Language-guided Open-world Video Anomaly Detection under Weak Supervision
Zihao Liu, Xiaoyu Wu, Jianqin Wu, Xuxu Wang, Linlin Yang
TL;DR
This work addresses open-world video anomaly detection under shifting anomaly definitions by modeling the anomaly label as a stochastic variable Z and predicting Y from the joint input (V,Z) using a language-guided paradigm. It introduces LaGoVAD, a multimodal architecture that fuses video and textual anomaly definitions, supported by dynamic video synthesis and contrastive hard negative mining to mitigate overfitting in a high-diversity, multilingual space. To enable robust training and evaluation, the authors construct PreVAD, a large-scale dataset with rich anomaly descriptions and multi-level taxonomy, enabling diverse, weakly supervised learning and rigorous zero-shot cross-dataset testing. Across seven datasets and two evaluation protocols, LaGoVAD achieves state-of-the-art zero-shot performance and demonstrates strong resilience to concept drift, highlighting the practical potential of language-guided open-world VAD for flexible, user-driven surveillance analysis. The work further provides comprehensive ablations and qualitative analyses, underscoring the value of semantic alignment and duration-diverse data synthesis for open-world anomaly detection.
Abstract
Video anomaly detection (VAD) aims to detect anomalies that deviate from what is expected. In open-world scenarios, the expected events may change as requirements change. For example, not wearing a mask may be considered abnormal during a flu outbreak but normal otherwise. However, existing methods assume that the definition of anomalies is invariable, and thus are not applicable to the open world. To address this, we propose a novel open-world VAD paradigm with variable definitions, allowing guided detection through user-provided natural language at inference time. This paradigm necessitates establishing a robust mapping from video and textual definition to anomaly scores. Therefore, we propose LaGoVAD (Language-guided Open-world Video Anomaly Detector), a model that dynamically adapts anomaly definitions under weak supervision with two regularization strategies: diversifying the relative durations of anomalies via dynamic video synthesis, and enhancing feature robustness through contrastive learning with negative mining. Training such adaptable models requires diverse anomaly definitions, but existing datasets typically provide labels without semantic descriptions. To bridge this gap, we collect PreVAD (Pre-training Video Anomaly Dataset), the largest and most diverse video anomaly dataset to date, featuring 35,279 annotated videos with multi-level category labels and descriptions that explicitly define anomalies. Zero-shot experiments on seven datasets demonstrate LaGoVAD's SOTA performance. Our dataset and code will be released at https://github.com/Kamino666/LaGoVAD-PreVAD.
