Abnormal Event Detection In Videos Using Deep Embedding
Darshan Venkatrayappa
TL;DR
The paper tackles video anomaly detection under unlabeled data by proposing a three-part hybrid framework that fuses depth, motion, and appearance features through a Central-Net inspired fusion block and then applies a one-class hypersphere objective to map normal data toward a hypercenter $c$ (minimizing $||\phi(x)-c||^2$). It first pretrains a convolutional autoencoder on fused features and then finetunes the encoder to align embeddings with $c$, enabling effective anomaly detection by distance to the center. Evaluations on UCSD Ped2, CUHK Avenue, and ShanghaiTech show results competitive with other unsupervised methods, validating the benefits of multi-modal fusion and the hypercenter approach. The work highlights the practical potential of unsupervised, multi-modal embedding learning for scalable surveillance analytics, with future directions including additional modalities like pose and audio and joint training of fusion components.
Abstract
Abnormal event detection or anomaly detection in surveillance videos is currently a challenge because of the diversity of possible events. Due to the lack of anomalous events at training time, anomaly detection requires the design of learning methods without supervision. In this work we propose an unsupervised approach for video anomaly detection with the aim to jointly optimize the objectives of the deep neural network and the anomaly detection task using a hybrid architecture. Initially, a convolutional autoencoder is pre-trained in an unsupervised manner with a fusion of depth, motion and appearance features. In the second step, we utilize the encoder part of the pre-trained autoencoder and extract the embeddings of the fused input. Now, we jointly train/ fine tune the encoder to map the embeddings to a hypercenter. Thus, embeddings of normal data fall near the hypercenter, whereas embeddings of anomalous data fall far away from the hypercenter.
