Table of Contents
Fetching ...

Language-guided Open-world Video Anomaly Detection under Weak Supervision

Zihao Liu, Xiaoyu Wu, Jianqin Wu, Xuxu Wang, Linlin Yang

TL;DR

This work addresses open-world video anomaly detection under shifting anomaly definitions by modeling the anomaly label as a stochastic variable Z and predicting Y from the joint input (V,Z) using a language-guided paradigm. It introduces LaGoVAD, a multimodal architecture that fuses video and textual anomaly definitions, supported by dynamic video synthesis and contrastive hard negative mining to mitigate overfitting in a high-diversity, multilingual space. To enable robust training and evaluation, the authors construct PreVAD, a large-scale dataset with rich anomaly descriptions and multi-level taxonomy, enabling diverse, weakly supervised learning and rigorous zero-shot cross-dataset testing. Across seven datasets and two evaluation protocols, LaGoVAD achieves state-of-the-art zero-shot performance and demonstrates strong resilience to concept drift, highlighting the practical potential of language-guided open-world VAD for flexible, user-driven surveillance analysis. The work further provides comprehensive ablations and qualitative analyses, underscoring the value of semantic alignment and duration-diverse data synthesis for open-world anomaly detection.

Abstract

Video anomaly detection (VAD) aims to detect anomalies that deviate from what is expected. In open-world scenarios, the expected events may change as requirements change. For example, not wearing a mask may be considered abnormal during a flu outbreak but normal otherwise. However, existing methods assume that the definition of anomalies is invariable, and thus are not applicable to the open world. To address this, we propose a novel open-world VAD paradigm with variable definitions, allowing guided detection through user-provided natural language at inference time. This paradigm necessitates establishing a robust mapping from video and textual definition to anomaly scores. Therefore, we propose LaGoVAD (Language-guided Open-world Video Anomaly Detector), a model that dynamically adapts anomaly definitions under weak supervision with two regularization strategies: diversifying the relative durations of anomalies via dynamic video synthesis, and enhancing feature robustness through contrastive learning with negative mining. Training such adaptable models requires diverse anomaly definitions, but existing datasets typically provide labels without semantic descriptions. To bridge this gap, we collect PreVAD (Pre-training Video Anomaly Dataset), the largest and most diverse video anomaly dataset to date, featuring 35,279 annotated videos with multi-level category labels and descriptions that explicitly define anomalies. Zero-shot experiments on seven datasets demonstrate LaGoVAD's SOTA performance. Our dataset and code will be released at https://github.com/Kamino666/LaGoVAD-PreVAD.

Language-guided Open-world Video Anomaly Detection under Weak Supervision

TL;DR

This work addresses open-world video anomaly detection under shifting anomaly definitions by modeling the anomaly label as a stochastic variable Z and predicting Y from the joint input (V,Z) using a language-guided paradigm. It introduces LaGoVAD, a multimodal architecture that fuses video and textual anomaly definitions, supported by dynamic video synthesis and contrastive hard negative mining to mitigate overfitting in a high-diversity, multilingual space. To enable robust training and evaluation, the authors construct PreVAD, a large-scale dataset with rich anomaly descriptions and multi-level taxonomy, enabling diverse, weakly supervised learning and rigorous zero-shot cross-dataset testing. Across seven datasets and two evaluation protocols, LaGoVAD achieves state-of-the-art zero-shot performance and demonstrates strong resilience to concept drift, highlighting the practical potential of language-guided open-world VAD for flexible, user-driven surveillance analysis. The work further provides comprehensive ablations and qualitative analyses, underscoring the value of semantic alignment and duration-diverse data synthesis for open-world anomaly detection.

Abstract

Video anomaly detection (VAD) aims to detect anomalies that deviate from what is expected. In open-world scenarios, the expected events may change as requirements change. For example, not wearing a mask may be considered abnormal during a flu outbreak but normal otherwise. However, existing methods assume that the definition of anomalies is invariable, and thus are not applicable to the open world. To address this, we propose a novel open-world VAD paradigm with variable definitions, allowing guided detection through user-provided natural language at inference time. This paradigm necessitates establishing a robust mapping from video and textual definition to anomaly scores. Therefore, we propose LaGoVAD (Language-guided Open-world Video Anomaly Detector), a model that dynamically adapts anomaly definitions under weak supervision with two regularization strategies: diversifying the relative durations of anomalies via dynamic video synthesis, and enhancing feature robustness through contrastive learning with negative mining. Training such adaptable models requires diverse anomaly definitions, but existing datasets typically provide labels without semantic descriptions. To bridge this gap, we collect PreVAD (Pre-training Video Anomaly Dataset), the largest and most diverse video anomaly dataset to date, featuring 35,279 annotated videos with multi-level category labels and descriptions that explicitly define anomalies. Zero-shot experiments on seven datasets demonstrate LaGoVAD's SOTA performance. Our dataset and code will be released at https://github.com/Kamino666/LaGoVAD-PreVAD.

Paper Structure

This paper contains 48 sections, 13 equations, 14 figures, 11 tables, 1 algorithm.

Figures (14)

  • Figure 1: Comparison of different VAD paradigms. Closed-set methods (b) can only detect anomalies in the training scope, while open-set methods (c) can detect novel anomalies. Our open-world approach (d) can deal with label change in open-world scenarios, with an example in (e).
  • Figure 2: Architecture of our proposed LaGoVAD, which implement Eq. \ref{['eq:ours']} by adding an anomaly definition branch ($z\rightarrow \mathcal{G} \rightarrow \mathcal{U}$). The model is trained with two novel regularization strategies: dynamic video synthesis $\mathcal{L}_{\text{dvs}}$ (\ref{['sec:dys-module']}) and contrastive learning loss with negative mining $\mathcal{L}_{\text{neg}}$ (\ref{['sec:neg-module']}).
  • Figure 2: Comparison in temporal binary anomaly detection under Protocol 1. Results marked with $\dagger$ are taken from their publications and results marked with $\ddagger$ are from LAVAD.
  • Figure 3: The statistics, comparisons and a data sample of the proposed PreVAD.
  • Figure 4: Visualization of different methods under concept drift. Knocking over a trashcan is considered normal in (a) but abnormal in (b). All models are prompted with the corresponding definition.
  • ...and 9 more figures