Transformer Based Self-Context Aware Prediction for Few-Shot Anomaly Detection in Videos
Gargi V. Pillai, Ashish Verma, Debashis Sen
TL;DR
The paper tackles video anomaly detection by learning the non-anomalous dynamics of a single video in a one-class, few-shot setting. It introduces a transformer-based model that predicts the feature of the next frame from a sequence of preceding frames, using a self-context that attends over the input sequence via a shared encoder–decoder. Features are fused from $\text{ResNet152}$ spatial features and $\text{FlowNet2}$ temporal features, and the model is trained with $L_{MSE}$ on non-anomalous frames; anomalies are detected when the predicted feature $\hat{F}_{T+1}$ deviates from the actual $F_{T+1}$, with temporal consistency applied. Experiments on the UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets show state-of-the-art performance on Ped2 and Avenue and competitive results on ShanghaiTech, with ablations confirming the benefit of the self-context. This approach reduces training data requirements and adapts to video-specific normality, offering practical benefits for surveillance scenarios.
Abstract
Anomaly detection in videos is a challenging task as anomalies in different videos are of different kinds. Therefore, a promising way to approach video anomaly detection is by learning the non-anomalous nature of the video at hand. To this end, we propose a one-class few-shot learning driven transformer based approach for anomaly detection in videos that is self-context aware. Features from the first few consecutive non-anomalous frames in a video are used to train the transformer in predicting the non-anomalous feature of the subsequent frame. This takes place under the attention of a self-context learned from the input features themselves. After the learning, given a few previous frames, the video-specific transformer is used to infer if a frame is anomalous or not by comparing the feature predicted by it with the actual. The effectiveness of the proposed method with respect to the state-of-the-art is demonstrated through qualitative and quantitative results on different standard datasets. We also study the positive effect of the self-context used in our approach.
