Table of Contents
Fetching ...

Video Anomaly Detection with Motion and Appearance Guided Patch Diffusion Model

Hang Zhou, Jiale Cai, Yuteng Ye, Yonghui Feng, Chenxing Gao, Junqing Yu, Zikai Song, Wei Yang

TL;DR

The paper tackles video anomaly detection by reframing it as future frame prediction using a patch-based diffusion model conditioned on appearance and motion. MA-PDM combines a memory-augmented appearance encoder with a patch-wise diffusion process to capture fine-grained local anomalies while leveraging temporal differences for motion cues. Extensive experiments on four benchmarks show state-of-the-art performance and clear ablations quantify the contributions of patch-based diffusion, appearance/motion conditioning, and the patch memory. The approach offers improved detection of small, localized anomalies and demonstrates the practicality of diffusion-based frame prediction for VAD, with potential for faster and richer conditioning in future work.

Abstract

A recent endeavor in one class of video anomaly detection is to leverage diffusion models and posit the task as a generation problem, where the diffusion model is trained to recover normal patterns exclusively, thus reporting abnormal patterns as outliers. Yet, existing attempts neglect the various formations of anomaly and predict normal samples at the feature level regardless that abnormal objects in surveillance videos are often relatively small. To address this, a novel patch-based diffusion model is proposed, specifically engineered to capture fine-grained local information. We further observe that anomalies in videos manifest themselves as deviations in both appearance and motion. Therefore, we argue that a comprehensive solution must consider both of these aspects simultaneously to achieve accurate frame prediction. To address this, we introduce innovative motion and appearance conditions that are seamlessly integrated into our patch diffusion model. These conditions are designed to guide the model in generating coherent and contextually appropriate predictions for both semantic content and motion relations. Experimental results in four challenging video anomaly detection datasets empirically substantiate the efficacy of our proposed approach, demonstrating that it consistently outperforms most existing methods in detecting abnormal behaviors.

Video Anomaly Detection with Motion and Appearance Guided Patch Diffusion Model

TL;DR

The paper tackles video anomaly detection by reframing it as future frame prediction using a patch-based diffusion model conditioned on appearance and motion. MA-PDM combines a memory-augmented appearance encoder with a patch-wise diffusion process to capture fine-grained local anomalies while leveraging temporal differences for motion cues. Extensive experiments on four benchmarks show state-of-the-art performance and clear ablations quantify the contributions of patch-based diffusion, appearance/motion conditioning, and the patch memory. The approach offers improved detection of small, localized anomalies and demonstrates the practicality of diffusion-based frame prediction for VAD, with potential for faster and richer conditioning in future work.

Abstract

A recent endeavor in one class of video anomaly detection is to leverage diffusion models and posit the task as a generation problem, where the diffusion model is trained to recover normal patterns exclusively, thus reporting abnormal patterns as outliers. Yet, existing attempts neglect the various formations of anomaly and predict normal samples at the feature level regardless that abnormal objects in surveillance videos are often relatively small. To address this, a novel patch-based diffusion model is proposed, specifically engineered to capture fine-grained local information. We further observe that anomalies in videos manifest themselves as deviations in both appearance and motion. Therefore, we argue that a comprehensive solution must consider both of these aspects simultaneously to achieve accurate frame prediction. To address this, we introduce innovative motion and appearance conditions that are seamlessly integrated into our patch diffusion model. These conditions are designed to guide the model in generating coherent and contextually appropriate predictions for both semantic content and motion relations. Experimental results in four challenging video anomaly detection datasets empirically substantiate the efficacy of our proposed approach, demonstrating that it consistently outperforms most existing methods in detecting abnormal behaviors.

Paper Structure

This paper contains 17 sections, 16 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: We propose a patch-based diffusion model with a motion and appearance conditions framework for VAD. The conditional frames, which contain motion of temporal difference and appearance information, are cropped into patches and used to estimate noise. These noises are then combined and refined into the frame-level noise. The DDIM reverse step is then applied to recover the clean image.
  • Figure 2: Our MA-PDM comprises three components: a patch cropping module for creating patch conditions and noise, an appearance encoder for embedding and retaining the regular pattern, and a noise estimation network for forecasting the noise. During the training stage, the MA-PDM is trained to predict forward noise. During the inference stage, the MA-PDM anticipates the patch noise using conditions and then combines them in a reverse process.
  • Figure 3: Four examples of anomaly detection comparison on Ped2 and ShanghaiTech datasets.