AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis
Zhiwei Yang, Chen Gao, Jing Liu, Peng Wu, Guansong Pang, Mike Zheng Shou
TL;DR
AssistPDA introduces an online Video Anomaly Prediction, Detection, and Analysis framework that unifies real-time anomaly forecasting, detection, and interactive analysis. It couples a vision encoder, a lightweight SpatioTemporal Relationship Distillation (STRD) module, and a tunable LLM (with LoRA) to translate streaming visual context into timely natural-language responses, while maintaining long-range temporal awareness. A new VAPDA-127K benchmark enables robust training and evaluation of online VAPDA tasks, including event-level prediction and open-ended anomaly analysis. Empirical results show state-of-the-art performance for real-time VAPDA across prediction, detection, and analysis, with practical streaming speeds of 15–20 FPS on standard GPUs and open-sourcing of data and code to foster further research.
Abstract
The rapid advancements in large language models (LLMs) have spurred growing interest in LLM-based video anomaly detection (VAD). However, existing approaches predominantly focus on video-level anomaly question answering or offline detection, ignoring the real-time nature essential for practical VAD applications. To bridge this gap and facilitate the practical deployment of LLM-based VAD, we introduce AssistPDA, the first online video anomaly surveillance assistant that unifies video anomaly prediction, detection, and analysis (VAPDA) within a single framework. AssistPDA enables real-time inference on streaming videos while supporting interactive user engagement. Notably, we introduce a novel event-level anomaly prediction task, enabling proactive anomaly forecasting before anomalies fully unfold. To enhance the ability to model intricate spatiotemporal relationships in anomaly events, we propose a Spatio-Temporal Relation Distillation (STRD) module. STRD transfers the long-term spatiotemporal modeling capabilities of vision-language models (VLMs) from offline settings to real-time scenarios. Thus it equips AssistPDA with a robust understanding of complex temporal dependencies and long-sequence memory. Additionally, we construct VAPDA-127K, the first large-scale benchmark designed for VLM-based online VAPDA. Extensive experiments demonstrate that AssistPDA outperforms existing offline VLM-based approaches, setting a new state-of-the-art for real-time VAPDA. Our dataset and code will be open-sourced to facilitate further research in the community.
