Table of Contents
Fetching ...

Bounding Boxes and Probabilistic Graphical Models: Video Anomaly Detection Simplified

Mia Siemon, Thomas B. Moeslund, Barry Norton, Kamal Nasrollahi

TL;DR

This study hypothesizes that the representation of objects via their bounding boxes only, can be sufficient to successfully identify anomalous events in a scene, and designs a model based on human reasoning which lends itself to explaining model output in human-understandable terms.

Abstract

In this study, we formulate the task of Video Anomaly Detection as a probabilistic analysis of object bounding boxes. We hypothesize that the representation of objects via their bounding boxes only, can be sufficient to successfully identify anomalous events in a scene. The implied value of this approach is increased object anonymization, faster model training and fewer computational resources. This can particularly benefit applications within video surveillance running on edge devices such as cameras. We design our model based on human reasoning which lends itself to explaining model output in human-understandable terms. Meanwhile, the slowest model trains within less than 7 seconds on a 11th Generation Intel Core i9 Processor. While our approach constitutes a drastic reduction of problem feature space in comparison with prior art, we show that this does not result in a reduction in performance: the results we report are highly competitive on the benchmark datasets CUHK Avenue and ShanghaiTech, and significantly exceed on the latest State-of-the-Art results on StreetScene, which has so far proven to be the most challenging VAD dataset.

Bounding Boxes and Probabilistic Graphical Models: Video Anomaly Detection Simplified

TL;DR

This study hypothesizes that the representation of objects via their bounding boxes only, can be sufficient to successfully identify anomalous events in a scene, and designs a model based on human reasoning which lends itself to explaining model output in human-understandable terms.

Abstract

In this study, we formulate the task of Video Anomaly Detection as a probabilistic analysis of object bounding boxes. We hypothesize that the representation of objects via their bounding boxes only, can be sufficient to successfully identify anomalous events in a scene. The implied value of this approach is increased object anonymization, faster model training and fewer computational resources. This can particularly benefit applications within video surveillance running on edge devices such as cameras. We design our model based on human reasoning which lends itself to explaining model output in human-understandable terms. Meanwhile, the slowest model trains within less than 7 seconds on a 11th Generation Intel Core i9 Processor. While our approach constitutes a drastic reduction of problem feature space in comparison with prior art, we show that this does not result in a reduction in performance: the results we report are highly competitive on the benchmark datasets CUHK Avenue and ShanghaiTech, and significantly exceed on the latest State-of-the-Art results on StreetScene, which has so far proven to be the most challenging VAD dataset.
Paper Structure (38 sections, 9 figures, 3 tables)

This paper contains 38 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Proposed Video Anomaly Detection Pipeline
  • Figure 2: Left: All spatial RVs from \ref{['tab:random-variables']} except for the frame F are illustrated on a sample image from StreetScene which was converted to greyscale for better visualization purposes. Right: Our proposed BN model with conditional relations between all RVs to perform VAD.
  • Figure 3: A visualization proposal of explaining the anomaly score extracted for a set of cells given the current appearance/velocity expressed of the bounding box through RVs defined in \ref{['tab:random-variables']}.
  • Figure 4: Contrasting spatial and spatio-temporal model versions (\ref{['fig:network-structure--spatio-temporal']}) based on three concatenated CUHK Avenue test videos, containing only temporal anomalies (a man running, a child jumping) in terms of frame-level AUC scores. The images are generated by the spatio-temporal model version.
  • Figure 5: Contrasting different observation generation techniques. Reported results (%) are based on test video #03 of CUHK Avenue. Ground Truth annotations are drawn in red, and detections in green (0 = anomalous, 1 = normal).
  • ...and 4 more figures