Table of Contents
Fetching ...

Video Anomaly Detection via Spatio-Temporal Pseudo-Anomaly Generation : A Unified Approach

Ayush K. Rai, Tarun Krishna, Feiyan Hu, Alexandru Drimbarean, Kevin McGuinness, Alan F. Smeaton, Noel E. O'Connor

TL;DR

This work tackles video anomaly detection under open-set conditions by generating generic spatio-temporal pseudo-anomalies without dataset-specific priors. It leverages a pre-trained Latent Diffusion Model to create spatial PAs via inpainting and applies mixup to optical-flow patches for temporal PAs, combined with a ViFi-CLIP-based semantic discriminator to capture semantic inconsistency. A unified OCC framework jointly estimates reconstruction quality, temporal irregularity, and semantic inconsistency through two 3D-CNN autoencoders and a semantic discriminator, with an aggregated anomaly score across three indicators. Experiments on Ped2, Avenue, ShanghaiTech, and UBnormal show competitive performance to state-of-the-art methods and evidence of transferability of PAs across datasets, highlighting robustness and generalization of the approach.

Abstract

Video Anomaly Detection (VAD) is an open-set recognition task, which is usually formulated as a one-class classification (OCC) problem, where training data is comprised of videos with normal instances while test data contains both normal and anomalous instances. Recent works have investigated the creation of pseudo-anomalies (PAs) using only the normal data and making strong assumptions about real-world anomalies with regards to abnormality of objects and speed of motion to inject prior information about anomalies in an autoencoder (AE) based reconstruction model during training. This work proposes a novel method for generating generic spatio-temporal PAs by inpainting a masked out region of an image using a pre-trained Latent Diffusion Model and further perturbing the optical flow using mixup to emulate spatio-temporal distortions in the data. In addition, we present a simple unified framework to detect real-world anomalies under the OCC setting by learning three types of anomaly indicators, namely reconstruction quality, temporal irregularity and semantic inconsistency. Extensive experiments on four VAD benchmark datasets namely Ped2, Avenue, ShanghaiTech and UBnormal demonstrate that our method performs on par with other existing state-of-the-art PAs generation and reconstruction based methods under the OCC setting. Our analysis also examines the transferability and generalisation of PAs across these datasets, offering valuable insights by identifying real-world anomalies through PAs.

Video Anomaly Detection via Spatio-Temporal Pseudo-Anomaly Generation : A Unified Approach

TL;DR

This work tackles video anomaly detection under open-set conditions by generating generic spatio-temporal pseudo-anomalies without dataset-specific priors. It leverages a pre-trained Latent Diffusion Model to create spatial PAs via inpainting and applies mixup to optical-flow patches for temporal PAs, combined with a ViFi-CLIP-based semantic discriminator to capture semantic inconsistency. A unified OCC framework jointly estimates reconstruction quality, temporal irregularity, and semantic inconsistency through two 3D-CNN autoencoders and a semantic discriminator, with an aggregated anomaly score across three indicators. Experiments on Ped2, Avenue, ShanghaiTech, and UBnormal show competitive performance to state-of-the-art methods and evidence of transferability of PAs across datasets, highlighting robustness and generalization of the approach.

Abstract

Video Anomaly Detection (VAD) is an open-set recognition task, which is usually formulated as a one-class classification (OCC) problem, where training data is comprised of videos with normal instances while test data contains both normal and anomalous instances. Recent works have investigated the creation of pseudo-anomalies (PAs) using only the normal data and making strong assumptions about real-world anomalies with regards to abnormality of objects and speed of motion to inject prior information about anomalies in an autoencoder (AE) based reconstruction model during training. This work proposes a novel method for generating generic spatio-temporal PAs by inpainting a masked out region of an image using a pre-trained Latent Diffusion Model and further perturbing the optical flow using mixup to emulate spatio-temporal distortions in the data. In addition, we present a simple unified framework to detect real-world anomalies under the OCC setting by learning three types of anomaly indicators, namely reconstruction quality, temporal irregularity and semantic inconsistency. Extensive experiments on four VAD benchmark datasets namely Ped2, Avenue, ShanghaiTech and UBnormal demonstrate that our method performs on par with other existing state-of-the-art PAs generation and reconstruction based methods under the OCC setting. Our analysis also examines the transferability and generalisation of PAs across these datasets, offering valuable insights by identifying real-world anomalies through PAs.
Paper Structure (21 sections, 10 equations, 8 figures, 6 tables)

This paper contains 21 sections, 10 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The overall architecture of our approach consists of spatio-temporal PAs generators. Spatial PAs generator (eq. \ref{['spatial_pseudo_anomaly_eq']}) : $\mathcal{F}_{s}( \text{stack}(\mathbf{x}, \mathbf{x} \odot \mathbf{m}, \mathbf{m}); \theta)$and temporal PAs (eq. \ref{['temporal_pseudo_anomaly_eq']}) : $\mathcal{F}_{t} (\phi (\mathbf{x_{t}}, \mathbf{x_{(t+1)}}))$. The spatial and temporal PAs are sampled with probability $p_s$ and $p_t$ respectively. Our VAD framework unifies estimation of reconstruction quality (eq. \ref{['spatial_loss_objective']}), temporal irregularity (eq. \ref{['temporal_loss_objective']}) and semantic inconsistency.
  • Figure 2: Visualisation of spatial and temporal PAs, using segmentation masks. This approach also works with random masks.
  • Figure 3: Qualitative Assessment : Visualisation of anomaly score over time for sample videos in Avenue (left) and ShanghaiTech (right).
  • Figure 4: Visualisation of error heatmap for sample videos. Compared with other PAs generator methods in LNTRA astrid2021learning.
  • Figure 5: Qualitative Assessment : Visualisation of spatial and temporal PAs for all 4 datasets. Here we only show segmentation masks however the approach also works with random masks.
  • ...and 3 more figures