Video Anomaly Detection with Motion and Appearance Guided Patch Diffusion Model
Hang Zhou, Jiale Cai, Yuteng Ye, Yonghui Feng, Chenxing Gao, Junqing Yu, Zikai Song, Wei Yang
TL;DR
The paper tackles video anomaly detection by reframing it as future frame prediction using a patch-based diffusion model conditioned on appearance and motion. MA-PDM combines a memory-augmented appearance encoder with a patch-wise diffusion process to capture fine-grained local anomalies while leveraging temporal differences for motion cues. Extensive experiments on four benchmarks show state-of-the-art performance and clear ablations quantify the contributions of patch-based diffusion, appearance/motion conditioning, and the patch memory. The approach offers improved detection of small, localized anomalies and demonstrates the practicality of diffusion-based frame prediction for VAD, with potential for faster and richer conditioning in future work.
Abstract
A recent endeavor in one class of video anomaly detection is to leverage diffusion models and posit the task as a generation problem, where the diffusion model is trained to recover normal patterns exclusively, thus reporting abnormal patterns as outliers. Yet, existing attempts neglect the various formations of anomaly and predict normal samples at the feature level regardless that abnormal objects in surveillance videos are often relatively small. To address this, a novel patch-based diffusion model is proposed, specifically engineered to capture fine-grained local information. We further observe that anomalies in videos manifest themselves as deviations in both appearance and motion. Therefore, we argue that a comprehensive solution must consider both of these aspects simultaneously to achieve accurate frame prediction. To address this, we introduce innovative motion and appearance conditions that are seamlessly integrated into our patch diffusion model. These conditions are designed to guide the model in generating coherent and contextually appropriate predictions for both semantic content and motion relations. Experimental results in four challenging video anomaly detection datasets empirically substantiate the efficacy of our proposed approach, demonstrating that it consistently outperforms most existing methods in detecting abnormal behaviors.
