Cross-Domain Learning for Video Anomaly Detection with Limited Supervision
Yashika Jain, Ali Dabouei, Min Xu
TL;DR
This work tackles cross-domain video anomaly detection under limited supervision by introducing Cross-Domain Learning (CDL), a weakly-supervised framework that leverages external unlabeled data to improve generalization. It jointly trains two predictors with distinct backbones (CLIP and I3D), estimates prediction bias on external data, and adaptively reweights learning using segment-level uncertainty quantified via cosine similarity of latent representations. Through iterative pseudo-label refinement and uncertainty-driven training, CDL achieves state-of-the-art cross-domain performance on UCF-Crime and XD-Violence, and robust open-set results, while highlighting the importance of accurate test annotations. The approach demonstrates practical impact by enabling better anomaly localization in unseen domains with limited labeling, and provides insights into uncertainty as a proxy for pseudo-label quality in self-training for VAD.
Abstract
Video Anomaly Detection (VAD) automates the identification of unusual events, such as security threats in surveillance videos. In real-world applications, VAD models must effectively operate in cross-domain settings, identifying rare anomalies and scenarios not well-represented in the training data. However, existing cross-domain VAD methods focus on unsupervised learning, resulting in performance that falls short of real-world expectations. Since acquiring weak supervision, i.e., video-level labels, for the source domain is cost-effective, we conjecture that combining it with external unlabeled data has notable potential to enhance cross-domain performance. To this end, we introduce a novel weakly-supervised framework for Cross-Domain Learning (CDL) in VAD that incorporates external data during training by estimating its prediction bias and adaptively minimizing that using the predicted uncertainty. We demonstrate the effectiveness of the proposed CDL framework through comprehensive experiments conducted in various configurations on two large-scale VAD datasets: UCF-Crime and XD-Violence. Our method significantly surpasses the state-of-the-art works in cross-domain evaluations, achieving an average absolute improvement of 19.6% on UCF-Crime and 12.87% on XD-Violence.
