Table of Contents
Fetching ...

Quo Vadis, Anomaly Detection? LLMs and VLMs in the Spotlight

Xi Ding, Lei Wang

TL;DR

This paper surveys recent progress in video anomaly detection using large language and vision-language models (LLMs and VLMs). It identifies four focal areas: enhancing interpretability via semantic explanations, modeling long-range temporal dependencies, enabling training-free and few-shot detection, and addressing open-world/class-agnostic anomalies. It provides a comparative analysis of 2024 methods, discusses their advantages and limitations, and proposes hybrid, multi-faceted approaches to improve robustness and scalability. The work underscores the potential of multimodal reasoning to redefine VAD and outlines directions for future research.

Abstract

Video anomaly detection (VAD) has witnessed significant advancements through the integration of large language models (LLMs) and vision-language models (VLMs), addressing critical challenges such as interpretability, temporal reasoning, and generalization in dynamic, open-world scenarios. This paper presents an in-depth review of cutting-edge LLM-/VLM-based methods in 2024, focusing on four key aspects: (i) enhancing interpretability through semantic insights and textual explanations, making visual anomalies more understandable; (ii) capturing intricate temporal relationships to detect and localize dynamic anomalies across video frames; (iii) enabling few-shot and zero-shot detection to minimize reliance on large, annotated datasets; and (iv) addressing open-world and class-agnostic anomalies by using semantic understanding and motion features for spatiotemporal coherence. We highlight their potential to redefine the landscape of VAD. Additionally, we explore the synergy between visual and textual modalities offered by LLMs and VLMs, highlighting their combined strengths and proposing future directions to fully exploit the potential in enhancing video anomaly detection.

Quo Vadis, Anomaly Detection? LLMs and VLMs in the Spotlight

TL;DR

This paper surveys recent progress in video anomaly detection using large language and vision-language models (LLMs and VLMs). It identifies four focal areas: enhancing interpretability via semantic explanations, modeling long-range temporal dependencies, enabling training-free and few-shot detection, and addressing open-world/class-agnostic anomalies. It provides a comparative analysis of 2024 methods, discusses their advantages and limitations, and proposes hybrid, multi-faceted approaches to improve robustness and scalability. The work underscores the potential of multimodal reasoning to redefine VAD and outlines directions for future research.

Abstract

Video anomaly detection (VAD) has witnessed significant advancements through the integration of large language models (LLMs) and vision-language models (VLMs), addressing critical challenges such as interpretability, temporal reasoning, and generalization in dynamic, open-world scenarios. This paper presents an in-depth review of cutting-edge LLM-/VLM-based methods in 2024, focusing on four key aspects: (i) enhancing interpretability through semantic insights and textual explanations, making visual anomalies more understandable; (ii) capturing intricate temporal relationships to detect and localize dynamic anomalies across video frames; (iii) enabling few-shot and zero-shot detection to minimize reliance on large, annotated datasets; and (iv) addressing open-world and class-agnostic anomalies by using semantic understanding and motion features for spatiotemporal coherence. We highlight their potential to redefine the landscape of VAD. Additionally, we explore the synergy between visual and textual modalities offered by LLMs and VLMs, highlighting their combined strengths and proposing future directions to fully exploit the potential in enhancing video anomaly detection.

Paper Structure

This paper contains 9 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: We present a systematic evaluation of 13 closely related works from 2024 that use large language models (LLMs) and vision-language models (VLMs) for video anomaly detection (VAD). The analysis is organized around four key perspectives: (a) temporal modeling, (b) interpretability, (c) training-free, and (d) open-world detection, each represented by a subfigure. For each perspective, we highlight the strategies used, key strengths, limitations, and outline promising directions for future research. The video frames used in the analysis are sourced from the MSAD zhuadvancing dataset.
  • Figure 2: Various sampling strategies.