Distilling Aggregated Knowledge for Weakly-Supervised Video Anomaly Detection
Jash Dalvi, Ali Dabouei, Gunjan Dhanuka, Min Xu
TL;DR
The paper tackles weakly-supervised video anomaly detection under limited labeled data by introducing DAKD, a framework that first builds a Teacher model aggregating multiple backbones via a Temporal Aggregation Module with disentangled cross-attention. This aggregated knowledge is then distilled into a lightweight Student through a bi-level pipeline that combines prediction-level distillation (using MIL-derived soft labels) and feature-level distillation (InfoNCE with a temperature parameter), yielding a practical, single-backbone model. Empirical results across UCF-Crime, ShanghaiTech, and XD-Violence demonstrate state-of-the-art performance and robust anomaly localization, with ablations validating the importance of TAM, distillation losses, and backbone diversity. The approach offers significant practical impact for real-world surveillance by balancing accuracy with computational efficiency.
Abstract
Video anomaly detection aims to develop automated models capable of identifying abnormal events in surveillance videos. The benchmark setup for this task is extremely challenging due to: i) the limited size of the training sets, ii) weak supervision provided in terms of video-level labels, and iii) intrinsic class imbalance induced by the scarcity of abnormal events. In this work, we show that distilling knowledge from aggregated representations of multiple backbones into a single-backbone Student model achieves state-of-the-art performance. In particular, we develop a bi-level distillation approach along with a novel disentangled cross-attention-based feature aggregation network. Our proposed approach, DAKD (Distilling Aggregated Knowledge with Disentangled Attention), demonstrates superior performance compared to existing methods across multiple benchmark datasets. Notably, we achieve significant improvements of 1.36%, 0.78%, and 7.02% on the UCF-Crime, ShanghaiTech, and XD-Violence datasets, respectively.
