Table of Contents
Fetching ...

Pseudo Anomalies Are All You Need: Diffusion-Based Generation for Weakly-Supervised Video Anomaly Detection

Satoshi Hashimoto, Hitoshi Nishimura, Yanan Wang, Mori Kurokawa

TL;DR

The paper tackles the data bottleneck in video anomaly detection by introducing PA-VAD, a generation-driven WVAD framework that trains exclusively on real normal footage and diffusion-generated pseudo-abnormal videos. It introduces CA-PAG to produce class-aware pseudo anomalies via CLIP-guided seed selection and VLM-refined prompts, and DARM to mitigate covariate shift and MIL bias through domain alignment and memory-based prototype balancing. Empirical results on ShanghaiTech and UCF-Crime show state-of-the-art performance among UVAD/WVAD baselines and competitive results against real-abnormal pipelines, highlighting substantial reductions in data-collection costs. The work demonstrates that high-accuracy anomaly detection can be achieved without collecting real anomalies, enabling scalable, practical deployment. key contributions include the CA-PAG generator, the DARM regularization mechanism, and a thorough evaluation of pseudo-only training under weak supervision.

Abstract

Deploying video anomaly detection in practice is hampered by the scarcity and collection cost of real abnormal footage. We address this by training without any real abnormal videos while evaluating under the standard weakly supervised split, and we introduce PA-VAD, a generation-driven approach that learns a detector from synthesized pseudo-abnormal videos paired with real normal videos, using only a small set of real normal images to drive synthesis. For synthesis, we select class-relevant initial images with CLIP and refine textual prompts with a vision-language model to improve fidelity and scene consistency before invoking a video diffusion model. For training, we mitigate excessive spatiotemporal magnitude in synthesized anomalies by an domain-aligned regularized module that combines domain alignment and memory usage-aware updates. Extensive experiments show that our approach reaches 98.2% on ShanghaiTech and 82.5% on UCF-Crime, surpassing the strongest real-abnormal method on ShanghaiTech by +0.6% and outperforming the UVAD state-of-the-art on UCF-Crime by +1.9%. The results demonstrate that high-accuracy anomaly detection can be obtained without collecting real anomalies, providing a practical path toward scalable deployment.

Pseudo Anomalies Are All You Need: Diffusion-Based Generation for Weakly-Supervised Video Anomaly Detection

TL;DR

The paper tackles the data bottleneck in video anomaly detection by introducing PA-VAD, a generation-driven WVAD framework that trains exclusively on real normal footage and diffusion-generated pseudo-abnormal videos. It introduces CA-PAG to produce class-aware pseudo anomalies via CLIP-guided seed selection and VLM-refined prompts, and DARM to mitigate covariate shift and MIL bias through domain alignment and memory-based prototype balancing. Empirical results on ShanghaiTech and UCF-Crime show state-of-the-art performance among UVAD/WVAD baselines and competitive results against real-abnormal pipelines, highlighting substantial reductions in data-collection costs. The work demonstrates that high-accuracy anomaly detection can be achieved without collecting real anomalies, enabling scalable, practical deployment. key contributions include the CA-PAG generator, the DARM regularization mechanism, and a thorough evaluation of pseudo-only training under weak supervision.

Abstract

Deploying video anomaly detection in practice is hampered by the scarcity and collection cost of real abnormal footage. We address this by training without any real abnormal videos while evaluating under the standard weakly supervised split, and we introduce PA-VAD, a generation-driven approach that learns a detector from synthesized pseudo-abnormal videos paired with real normal videos, using only a small set of real normal images to drive synthesis. For synthesis, we select class-relevant initial images with CLIP and refine textual prompts with a vision-language model to improve fidelity and scene consistency before invoking a video diffusion model. For training, we mitigate excessive spatiotemporal magnitude in synthesized anomalies by an domain-aligned regularized module that combines domain alignment and memory usage-aware updates. Extensive experiments show that our approach reaches 98.2% on ShanghaiTech and 82.5% on UCF-Crime, surpassing the strongest real-abnormal method on ShanghaiTech by +0.6% and outperforming the UVAD state-of-the-art on UCF-Crime by +1.9%. The results demonstrate that high-accuracy anomaly detection can be obtained without collecting real anomalies, providing a practical path toward scalable deployment.

Paper Structure

This paper contains 15 sections, 15 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Our PA-VAD framework generates class-aware pseudo-abnormal videos (e.g., Explosion, Road Accidents, Assault) from only a small set of normal images. These controllable pseudo anomalies provide scalable and diverse supervision, eliminating the need for real abnormal data—the primary bottleneck in weakly supervised VAD—and substantially expanding the range of trainable abnormal patterns.
  • Figure 2: Overview of our framework. PA-VAD trains on real Normal and diffusion-generated pseudo videos (no real Abnormal) and is evaluated on standard splits with real Normal/Abnormal.
  • Figure 3: Overview of our framework. Starting from a small set of real normal images and class texts, we synthesize pseudo-abnormal videos via an image-to-video diffusion process, then train a classifier on real normal and synthesized pseudo-abnormal videos. An Domain-Aligned Regularized Module—designed to account for the characteristic large spatiotemporal magnitude of synthesized anomalies—mitigates Multiple Instance Learning (MIL) bias and enables accurate detection.
  • Figure 4: Initial image selection in the vision–text space. We score normal images using a class text with positive phrases and subtract an aggregated negative similarity, then take the Top-$K$ per class.
  • Figure 5: Prompt refinement. A VLM extracts scene- and object-aware cues from the initial image under a class-aware instruction and produces concise abnormal descriptions, which are concatenated with template phrases before driving the diffusion model.
  • ...and 5 more figures