Graph-Jigsaw Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection
Ali Karami, Thi Kieu Khanh Ho, Narges Armanfard
TL;DR
The paper tackles skeleton-based video anomaly detection (SVAD) by addressing three core challenges: capturing spatio-temporal dependencies among joints, recognizing region-specific discrepancies in motion, and accounting for the infinite variation of human actions. It introduces GiCiSAD, a lightweight framework consisting of a Graph Attention-Based Forecasting module, a Graph-level Jigsaw Puzzle Maker for self-supervised region-level discrimination, and a Graph-based Conditional Diffusion Model to generate diverse future motions conditioned on past frames. The method achieves state-of-the-art AUROC on four benchmark SVAD datasets while using up to 40% fewer parameters than prior unsupervised approaches, highlighting both effectiveness and efficiency. By combining dynamic graph learning, challenging graph-level self-supervision, and diffusion-based diverse generation, GiCiSAD robustly detects anomalies across varied motions and regions, with practical potential for real-time surveillance applications.
Abstract
Skeleton-based video anomaly detection (SVAD) is a crucial task in computer vision. Accurately identifying abnormal patterns or events enables operators to promptly detect suspicious activities, thereby enhancing safety. Achieving this demands a comprehensive understanding of human motions, both at body and region levels, while also accounting for the wide variations of performing a single action. However, existing studies fail to simultaneously address these crucial properties. This paper introduces a novel, practical and lightweight framework, namely Graph-Jigsaw Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection (GiCiSAD) to overcome the challenges associated with SVAD. GiCiSAD consists of three novel modules: the Graph Attention-based Forecasting module to capture the spatio-temporal dependencies inherent in the data, the Graph-level Jigsaw Puzzle Maker module to distinguish subtle region-level discrepancies between normal and abnormal motions, and the Graph-based Conditional Diffusion model to generate a wide spectrum of human motions. Extensive experiments on four widely used skeleton-based video datasets show that GiCiSAD outperforms existing methods with significantly fewer training parameters, establishing it as the new state-of-the-art.
