Table of Contents
Fetching ...

AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis

Zhiwei Yang, Chen Gao, Jing Liu, Peng Wu, Guansong Pang, Mike Zheng Shou

TL;DR

AssistPDA introduces an online Video Anomaly Prediction, Detection, and Analysis framework that unifies real-time anomaly forecasting, detection, and interactive analysis. It couples a vision encoder, a lightweight SpatioTemporal Relationship Distillation (STRD) module, and a tunable LLM (with LoRA) to translate streaming visual context into timely natural-language responses, while maintaining long-range temporal awareness. A new VAPDA-127K benchmark enables robust training and evaluation of online VAPDA tasks, including event-level prediction and open-ended anomaly analysis. Empirical results show state-of-the-art performance for real-time VAPDA across prediction, detection, and analysis, with practical streaming speeds of 15–20 FPS on standard GPUs and open-sourcing of data and code to foster further research.

Abstract

The rapid advancements in large language models (LLMs) have spurred growing interest in LLM-based video anomaly detection (VAD). However, existing approaches predominantly focus on video-level anomaly question answering or offline detection, ignoring the real-time nature essential for practical VAD applications. To bridge this gap and facilitate the practical deployment of LLM-based VAD, we introduce AssistPDA, the first online video anomaly surveillance assistant that unifies video anomaly prediction, detection, and analysis (VAPDA) within a single framework. AssistPDA enables real-time inference on streaming videos while supporting interactive user engagement. Notably, we introduce a novel event-level anomaly prediction task, enabling proactive anomaly forecasting before anomalies fully unfold. To enhance the ability to model intricate spatiotemporal relationships in anomaly events, we propose a Spatio-Temporal Relation Distillation (STRD) module. STRD transfers the long-term spatiotemporal modeling capabilities of vision-language models (VLMs) from offline settings to real-time scenarios. Thus it equips AssistPDA with a robust understanding of complex temporal dependencies and long-sequence memory. Additionally, we construct VAPDA-127K, the first large-scale benchmark designed for VLM-based online VAPDA. Extensive experiments demonstrate that AssistPDA outperforms existing offline VLM-based approaches, setting a new state-of-the-art for real-time VAPDA. Our dataset and code will be open-sourced to facilitate further research in the community.

AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis

TL;DR

AssistPDA introduces an online Video Anomaly Prediction, Detection, and Analysis framework that unifies real-time anomaly forecasting, detection, and interactive analysis. It couples a vision encoder, a lightweight SpatioTemporal Relationship Distillation (STRD) module, and a tunable LLM (with LoRA) to translate streaming visual context into timely natural-language responses, while maintaining long-range temporal awareness. A new VAPDA-127K benchmark enables robust training and evaluation of online VAPDA tasks, including event-level prediction and open-ended anomaly analysis. Empirical results show state-of-the-art performance for real-time VAPDA across prediction, detection, and analysis, with practical streaming speeds of 15–20 FPS on standard GPUs and open-sourcing of data and code to foster further research.

Abstract

The rapid advancements in large language models (LLMs) have spurred growing interest in LLM-based video anomaly detection (VAD). However, existing approaches predominantly focus on video-level anomaly question answering or offline detection, ignoring the real-time nature essential for practical VAD applications. To bridge this gap and facilitate the practical deployment of LLM-based VAD, we introduce AssistPDA, the first online video anomaly surveillance assistant that unifies video anomaly prediction, detection, and analysis (VAPDA) within a single framework. AssistPDA enables real-time inference on streaming videos while supporting interactive user engagement. Notably, we introduce a novel event-level anomaly prediction task, enabling proactive anomaly forecasting before anomalies fully unfold. To enhance the ability to model intricate spatiotemporal relationships in anomaly events, we propose a Spatio-Temporal Relation Distillation (STRD) module. STRD transfers the long-term spatiotemporal modeling capabilities of vision-language models (VLMs) from offline settings to real-time scenarios. Thus it equips AssistPDA with a robust understanding of complex temporal dependencies and long-sequence memory. Additionally, we construct VAPDA-127K, the first large-scale benchmark designed for VLM-based online VAPDA. Extensive experiments demonstrate that AssistPDA outperforms existing offline VLM-based approaches, setting a new state-of-the-art for real-time VAPDA. Our dataset and code will be open-sourced to facilitate further research in the community.

Paper Structure

This paper contains 19 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Illustration of the proposed Video Anomaly Prediction, Detection, and Analysis (VAPDA) tasks.
  • Figure 2: Pipeline of data construction for the proposed VAPDA-127K dataset.
  • Figure 3: Pipeline of our method. VE and STRD are short for Video Encoder and Spatiotemporal relation distillation, respectively.
  • Figure 4: Illustration of the STRD module.
  • Figure 5: Visualization results on the test set.
  • ...and 1 more figures