Table of Contents
Fetching ...

Video Anomaly Detection in 10 Years: A Survey and Outlook

Moshira Abdalla, Sajid Javed, Muaz Al Radi, Anwaar Ulhaq, Naoufel Werghi

TL;DR

This survey addresses the problem of detecting anomalies in video data by consolidating deep learning approaches across supervised, weakly supervised, self-supervised, and unsupervised paradigms. It foregrounds the role of vision-language models (VLMs) and cross-modal features, detailing a taxonomy that interrelates learning schemes with feature extractors, and it systematically evaluates state-of-the-art methods on major benchmarks like UCF-Crime, ShanghaiTech, and XD-Violence. Key contributions include a comprehensive dataset and feature-type analysis, a critical comparison of SOTA models, and forward-looking guidance on loss functions, regularization, and multi-modal learning to improve real-world robustness. The findings underscore a shift toward VLM-augmented, transformer-based approaches that leverage textual context to enhance anomaly understanding, with broader implications for scalable, privacy-conscious, and interpretable VAD systems in surveillance and related domains.

Abstract

Video anomaly detection (VAD) holds immense importance across diverse domains such as surveillance, healthcare, and environmental monitoring. While numerous surveys focus on conventional VAD methods, they often lack depth in exploring specific approaches and emerging trends. This survey explores deep learning-based VAD, expanding beyond traditional supervised training paradigms to encompass emerging weakly supervised, self-supervised, and unsupervised approaches. A prominent feature of this review is the investigation of core challenges within the VAD paradigms including large-scale datasets, features extraction, learning methods, loss functions, regularization, and anomaly score prediction. Moreover, this review also investigates the vision language models (VLMs) as potent feature extractors for VAD. VLMs integrate visual data with textual descriptions or spoken language from videos, enabling a nuanced understanding of scenes crucial for anomaly detection. By addressing these challenges and proposing future research directions, this review aims to foster the development of robust and efficient VAD systems leveraging the capabilities of VLMs for enhanced anomaly detection in complex real-world scenarios. This comprehensive analysis seeks to bridge existing knowledge gaps, provide researchers with valuable insights, and contribute to shaping the future of VAD research.

Video Anomaly Detection in 10 Years: A Survey and Outlook

TL;DR

This survey addresses the problem of detecting anomalies in video data by consolidating deep learning approaches across supervised, weakly supervised, self-supervised, and unsupervised paradigms. It foregrounds the role of vision-language models (VLMs) and cross-modal features, detailing a taxonomy that interrelates learning schemes with feature extractors, and it systematically evaluates state-of-the-art methods on major benchmarks like UCF-Crime, ShanghaiTech, and XD-Violence. Key contributions include a comprehensive dataset and feature-type analysis, a critical comparison of SOTA models, and forward-looking guidance on loss functions, regularization, and multi-modal learning to improve real-world robustness. The findings underscore a shift toward VLM-augmented, transformer-based approaches that leverage textual context to enhance anomaly understanding, with broader implications for scalable, privacy-conscious, and interpretable VAD systems in surveillance and related domains.

Abstract

Video anomaly detection (VAD) holds immense importance across diverse domains such as surveillance, healthcare, and environmental monitoring. While numerous surveys focus on conventional VAD methods, they often lack depth in exploring specific approaches and emerging trends. This survey explores deep learning-based VAD, expanding beyond traditional supervised training paradigms to encompass emerging weakly supervised, self-supervised, and unsupervised approaches. A prominent feature of this review is the investigation of core challenges within the VAD paradigms including large-scale datasets, features extraction, learning methods, loss functions, regularization, and anomaly score prediction. Moreover, this review also investigates the vision language models (VLMs) as potent feature extractors for VAD. VLMs integrate visual data with textual descriptions or spoken language from videos, enabling a nuanced understanding of scenes crucial for anomaly detection. By addressing these challenges and proposing future research directions, this review aims to foster the development of robust and efficient VAD systems leveraging the capabilities of VLMs for enhanced anomaly detection in complex real-world scenarios. This comprehensive analysis seeks to bridge existing knowledge gaps, provide researchers with valuable insights, and contribute to shaping the future of VAD research.
Paper Structure (57 sections, 15 equations, 5 figures, 3 tables)

This paper contains 57 sections, 15 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Video anomaly detection paradigm including (A) state-of-the-art dataset building and selection \ref{['datasets']}, (B) Spatial, temporal, spatio-temporal, and textual deep feature extraction \ref{['feature extraction']}, (C) Diverse deep learning and supervision schemes (supervised, self-supervised, weakly supervised, and unsupervised methods) \ref{['DL methods']}, (D) selection of loss functions \ref{['loss func']}, (E) integration of regularization techniques within loss functions \ref{['reg']}, (F) anomaly score calculation \ref{['anomaly score']}, and (G) model evaluation techniques \ref{['evaluation']}.
  • Figure 2: Taxonomy of Video Anomaly Detection.
  • Figure 3: Sample frames showcasing the diversity of scenes and anomalies present in publicly available datasets used for Video Anomaly Detection. These frames offer a glimpse into the range of challenges and scenarios addressed within the field, providing valuable insights for testing and benchmarking anomaly detection models.
  • Figure 4: A qualitative comparison and illustration of correctly and incorrectly classified frames using four VAD models, namely Sultani et al.sultani2018real, GCN zaheer2022generative, CLAV cho2023look, and VAD-CLIP wu2023vadclip.
  • Figure 5: Visualizing Bibliometric Networks for Thematic Analysis of Recent Literature (50 top cited papers) on Video Anomaly Detection between the year 2023-2024.