MULDE: Multiscale Log-Density Estimation via Denoising Score Matching for Video Anomaly Detection
Jakub Micorek, Horst Possegger, Dominik Narnhofer, Horst Bischof, Mateusz Kozinski
TL;DR
MULDE introduces a principled, multiscale density-based approach to video anomaly detection by learning a neural function $f_\theta(\mathbf{x},\sigma)$ that approximates $-\log q_\sigma(\tilde{\mathbf{x}})$ across a range of Gaussian noise levels. Training uses a modified denoising score matching objective with a regularization term to stabilize cross-scale estimates, and test-time aggregation is performed with a Gaussian mixture model on the multi-scale log-density outputs. The method is feature-agnostic and demonstrated to achieve state-of-the-art results on five benchmarks for both object-centric and frame-centric VAD, while maintaining fast inference limited primarily by feature extraction. These results emphasize a solid statistical foundation for anomaly detection and offer a practical, scalable solution adaptable to diverse video representations. Overall, MULDE provides a fast, interpretable, and effective framework for detecting unusual events by explicitly modeling the density of normal features through multiscale log-density estimation.
Abstract
We propose a novel approach to video anomaly detection: we treat feature vectors extracted from videos as realizations of a random variable with a fixed distribution and model this distribution with a neural network. This lets us estimate the likelihood of test videos and detect video anomalies by thresholding the likelihood estimates. We train our video anomaly detector using a modification of denoising score matching, a method that injects training data with noise to facilitate modeling its distribution. To eliminate hyperparameter selection, we model the distribution of noisy video features across a range of noise levels and introduce a regularizer that tends to align the models for different levels of noise. At test time, we combine anomaly indications at multiple noise scales with a Gaussian mixture model. Running our video anomaly detector induces minimal delays as inference requires merely extracting the features and forward-propagating them through a shallow neural network and a Gaussian mixture model. Our experiments on five popular video anomaly detection benchmarks demonstrate state-of-the-art performance, both in the object-centric and in the frame-centric setup.
