Table of Contents
Fetching ...

MULDE: Multiscale Log-Density Estimation via Denoising Score Matching for Video Anomaly Detection

Jakub Micorek, Horst Possegger, Dominik Narnhofer, Horst Bischof, Mateusz Kozinski

TL;DR

MULDE introduces a principled, multiscale density-based approach to video anomaly detection by learning a neural function $f_\theta(\mathbf{x},\sigma)$ that approximates $-\log q_\sigma(\tilde{\mathbf{x}})$ across a range of Gaussian noise levels. Training uses a modified denoising score matching objective with a regularization term to stabilize cross-scale estimates, and test-time aggregation is performed with a Gaussian mixture model on the multi-scale log-density outputs. The method is feature-agnostic and demonstrated to achieve state-of-the-art results on five benchmarks for both object-centric and frame-centric VAD, while maintaining fast inference limited primarily by feature extraction. These results emphasize a solid statistical foundation for anomaly detection and offer a practical, scalable solution adaptable to diverse video representations. Overall, MULDE provides a fast, interpretable, and effective framework for detecting unusual events by explicitly modeling the density of normal features through multiscale log-density estimation.

Abstract

We propose a novel approach to video anomaly detection: we treat feature vectors extracted from videos as realizations of a random variable with a fixed distribution and model this distribution with a neural network. This lets us estimate the likelihood of test videos and detect video anomalies by thresholding the likelihood estimates. We train our video anomaly detector using a modification of denoising score matching, a method that injects training data with noise to facilitate modeling its distribution. To eliminate hyperparameter selection, we model the distribution of noisy video features across a range of noise levels and introduce a regularizer that tends to align the models for different levels of noise. At test time, we combine anomaly indications at multiple noise scales with a Gaussian mixture model. Running our video anomaly detector induces minimal delays as inference requires merely extracting the features and forward-propagating them through a shallow neural network and a Gaussian mixture model. Our experiments on five popular video anomaly detection benchmarks demonstrate state-of-the-art performance, both in the object-centric and in the frame-centric setup.

MULDE: Multiscale Log-Density Estimation via Denoising Score Matching for Video Anomaly Detection

TL;DR

MULDE introduces a principled, multiscale density-based approach to video anomaly detection by learning a neural function that approximates across a range of Gaussian noise levels. Training uses a modified denoising score matching objective with a regularization term to stabilize cross-scale estimates, and test-time aggregation is performed with a Gaussian mixture model on the multi-scale log-density outputs. The method is feature-agnostic and demonstrated to achieve state-of-the-art results on five benchmarks for both object-centric and frame-centric VAD, while maintaining fast inference limited primarily by feature extraction. These results emphasize a solid statistical foundation for anomaly detection and offer a practical, scalable solution adaptable to diverse video representations. Overall, MULDE provides a fast, interpretable, and effective framework for detecting unusual events by explicitly modeling the density of normal features through multiscale log-density estimation.

Abstract

We propose a novel approach to video anomaly detection: we treat feature vectors extracted from videos as realizations of a random variable with a fixed distribution and model this distribution with a neural network. This lets us estimate the likelihood of test videos and detect video anomalies by thresholding the likelihood estimates. We train our video anomaly detector using a modification of denoising score matching, a method that injects training data with noise to facilitate modeling its distribution. To eliminate hyperparameter selection, we model the distribution of noisy video features across a range of noise levels and introduce a regularizer that tends to align the models for different levels of noise. At test time, we combine anomaly indications at multiple noise scales with a Gaussian mixture model. Running our video anomaly detector induces minimal delays as inference requires merely extracting the features and forward-propagating them through a shallow neural network and a Gaussian mixture model. Our experiments on five popular video anomaly detection benchmarks demonstrate state-of-the-art performance, both in the object-centric and in the frame-centric setup.
Paper Structure (38 sections, 6 equations, 7 figures, 10 tables, 1 algorithm)

This paper contains 38 sections, 6 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: MULDE approximates the negative log-density of noisy, normal video features at multiple levels of noise $\sigma$ with a neural network $f(\cdot,\sigma)$. The log-likelihoods estimated at multiple noise levels are combined into a single anomaly score with a Gaussian mixture model (GMM). MULDE can be trained to detect video anomalies in an object-centric or frame-centric manner. In the object-centric approach, an object detector (OD) is used to detect objects which are then fed to the feature extractor (FE). In the frame-centric approach, the feature extractor is applied to short sequences of entire frames.
  • Figure 2: The log-density function is well suited for indicating anomalies, but its gradient is not. (Left:) A sample from a mixture of 4 Gaussians. (Right:) Learned negative log-density approximation (left column) and the norm of its gradient (right column). The negative log-density is a good anomaly indicator, taking low values for normal data and higher values for anomalous data. By contrast, the log-gradient norm is low not only at the modes of the distribution, but also at its minima between the modes, making it impossible to distinguish some anomalies from normal data.
  • Figure 3: Anomaly detection with MULDE in a test video of the ShanghaiTech data set (video 13 in scene 4). Pedestrians walking in frames 30 and 300 represent normal behavior. A person jumping across the scene is annotated as anomalous. The anomaly indication produced by MULDE is aligned with the ground truth (GT) at its beginning but terminates earlier than the GT annotation. However, careful examination of the video reveals that normal behavior (walking, cyan bounding box in the top row) is re-instantiated before the end of the annotation, as indicated by MULDE. A regularized model produces a stronger anomaly indication (plotted in blue) than one without regularization (green plot).
  • Figure 4: Performance of MULDE in frame-centric VAD on the ShanghaiTech data set without the Gaussian mixture model, with our anomaly indicator computed at individual noise scales (blue plot), and with the mixture model with 1, 3, and 5 components.
  • Figure S1: The log-density of normal training features is estimated with $f_{\theta}$ across multiple $\sigma$. MULDE leverages $f_{\theta}$ as a strong anomaly indicator.
  • ...and 2 more figures