Table of Contents
Fetching ...

No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection

Zunkai Dai, Ke Li, Jiajia Liu, Jie Yang, Yuanyuan Qiao

TL;DR

Evaluations across four benchmark VAD datasets demonstrate that LAVIDA achieves SOTA performance in both frame-level and pixel-level anomaly detection under the zero-shot setting, and a token compression approach based on reverse attention to handle the spatio-temporal scarcity of anomalous patterns and decrease computational cost.

Abstract

The collection and detection of video anomaly data has long been a challenging problem due to its rare occurrence and spatio-temporal scarcity. Existing video anomaly detection (VAD) methods under perform in open-world scenarios. Key contributing factors include limited dataset diversity, and inadequate understanding of context-dependent anomalous semantics. To address these issues, i) we propose LAVIDA, an end-to-end zero-shot video anomaly detection framework. ii) LAVIDA employs an Anomaly Exposure Sampler that transforms segmented objects into pseudo-anomalies to enhance model adaptability to unseen anomaly categories. It further integrates a Multimodal Large Language Model (MLLM) to bolster semantic comprehension capabilities. Additionally, iii) we design a token compression approach based on reverse attention to handle the spatio-temporal scarcity of anomalous patterns and decrease computational cost. The training process is conducted solely on pseudo anomalies without any VAD data. Evaluations across four benchmark VAD datasets demonstrate that LAVIDA achieves SOTA performance in both frame-level and pixel-level anomaly detection under the zero-shot setting. Our code is available in https://github.com/VitaminCreed/LAVIDA.

No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection

TL;DR

Evaluations across four benchmark VAD datasets demonstrate that LAVIDA achieves SOTA performance in both frame-level and pixel-level anomaly detection under the zero-shot setting, and a token compression approach based on reverse attention to handle the spatio-temporal scarcity of anomalous patterns and decrease computational cost.

Abstract

The collection and detection of video anomaly data has long been a challenging problem due to its rare occurrence and spatio-temporal scarcity. Existing video anomaly detection (VAD) methods under perform in open-world scenarios. Key contributing factors include limited dataset diversity, and inadequate understanding of context-dependent anomalous semantics. To address these issues, i) we propose LAVIDA, an end-to-end zero-shot video anomaly detection framework. ii) LAVIDA employs an Anomaly Exposure Sampler that transforms segmented objects into pseudo-anomalies to enhance model adaptability to unseen anomaly categories. It further integrates a Multimodal Large Language Model (MLLM) to bolster semantic comprehension capabilities. Additionally, iii) we design a token compression approach based on reverse attention to handle the spatio-temporal scarcity of anomalous patterns and decrease computational cost. The training process is conducted solely on pseudo anomalies without any VAD data. Evaluations across four benchmark VAD datasets demonstrate that LAVIDA achieves SOTA performance in both frame-level and pixel-level anomaly detection under the zero-shot setting. Our code is available in https://github.com/VitaminCreed/LAVIDA.
Paper Structure (25 sections, 10 equations, 7 figures, 5 tables)

This paper contains 25 sections, 10 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Motivation. Left: Existing VAD methods rely on training with anomaly data from single scenarios, resulting in poor generalization capability to novel anomaly types or unseen scenarios. Right: Our LAVIDA model leverages MLLM to understand deep anomaly semantics, enabling generalization to arbitrary anomaly types across diverse scenarios. The training data consists of pseudo anomaly data synthesized from external datasets, without incorporating any VAD data.
  • Figure 2: Overview of LAVIDA Framework. LAVIDA is trained solely on a comprehensive Anomaly Exposure datasets, and consists of five key components: a MLLM, a text encoder, a vision backbone, a SAM2 mask decoder, and a Multi-scale Semantic Projector.
  • Figure 3: Anomaly Exposure Sampler: We sample irrelevant categories from other samples to create anomaly categories, randomly designate samples as anomalous or normal based on probability.
  • Figure 4: Qualitative Results for Anomaly Detection. For each case, the first row presents pixel-level detection results whitch are masked by green. The second row displays frame-level anomaly scores, with temporal intervals of anomalous events marked in pink.
  • Figure 5: Quantitative Visualizations for Open-World Scenarios. The left panel shows the original image, and the right panel highlights detected anomalies with green masks.
  • ...and 2 more figures