Cross-Domain Learning for Video Anomaly Detection with Limited Supervision

Yashika Jain; Ali Dabouei; Min Xu

Cross-Domain Learning for Video Anomaly Detection with Limited Supervision

Yashika Jain, Ali Dabouei, Min Xu

TL;DR

This work tackles cross-domain video anomaly detection under limited supervision by introducing Cross-Domain Learning (CDL), a weakly-supervised framework that leverages external unlabeled data to improve generalization. It jointly trains two predictors with distinct backbones (CLIP and I3D), estimates prediction bias on external data, and adaptively reweights learning using segment-level uncertainty quantified via cosine similarity of latent representations. Through iterative pseudo-label refinement and uncertainty-driven training, CDL achieves state-of-the-art cross-domain performance on UCF-Crime and XD-Violence, and robust open-set results, while highlighting the importance of accurate test annotations. The approach demonstrates practical impact by enabling better anomaly localization in unseen domains with limited labeling, and provides insights into uncertainty as a proxy for pseudo-label quality in self-training for VAD.

Abstract

Video Anomaly Detection (VAD) automates the identification of unusual events, such as security threats in surveillance videos. In real-world applications, VAD models must effectively operate in cross-domain settings, identifying rare anomalies and scenarios not well-represented in the training data. However, existing cross-domain VAD methods focus on unsupervised learning, resulting in performance that falls short of real-world expectations. Since acquiring weak supervision, i.e., video-level labels, for the source domain is cost-effective, we conjecture that combining it with external unlabeled data has notable potential to enhance cross-domain performance. To this end, we introduce a novel weakly-supervised framework for Cross-Domain Learning (CDL) in VAD that incorporates external data during training by estimating its prediction bias and adaptively minimizing that using the predicted uncertainty. We demonstrate the effectiveness of the proposed CDL framework through comprehensive experiments conducted in various configurations on two large-scale VAD datasets: UCF-Crime and XD-Violence. Our method significantly surpasses the state-of-the-art works in cross-domain evaluations, achieving an average absolute improvement of 19.6% on UCF-Crime and 12.87% on XD-Violence.

Cross-Domain Learning for Video Anomaly Detection with Limited Supervision

TL;DR

Abstract

Paper Structure (25 sections, 13 equations, 5 figures, 6 tables)

This paper contains 25 sections, 13 equations, 5 figures, 6 tables.

Introduction
Related Works
Method
Problem Definition
Feature Extraction and Temporal Processing
Bias Estimation for External Data
Uncertainty Estimation
Training Process
Inference - Extending Segment-level Scores to Frame-level Scores
Experiments
Implementation Details
Noise in the Test Annotations of Benchmark Datasets
Comparison with Prior Works
Cross-Domain Scenarios
Open-Set Scenarios
...and 10 more sections

Figures (5)

Figure 1: Anomaly score comparison on a video of XD-Violence dataset, with and without employing the proposed CDL framework. The model trained without CDL on UCF-Crime as the weakly labeled set consistently yields high anomaly scores. In contrast, the model trained with CDL, using UCF-Crime as the weakly labeled set and HACS as the unlabeled set, is better able to localize the anomalous frames.
Figure 2: Overview of the proposed CDL Framework. CDL Step 0: The Ranking Loss, $\mathcal{L}_{\text{rank}}$ (Supp Mat. § 6), is employed to train two pseudo-label generation models, $P_m$ and $P_a$, § \ref{['sec:extraction']}, on weakly-labeled data, $\mathcal{D}_l$. CDL Step$k, k>0$: $P_m$ and $P_a$ are trained iteratively on $\mathcal{D}_l \cup \mathcal{D}_u$, incorporating pseudo-labels for $\mathcal{D}_u$ generated at the end of the previous CDL step. To deal with noise in pseudo-labels, uncertainty regularization scores are estimated using the divergence between the predictions of the two models, § \ref{['sec:uncertainty']}. When optimizing on $\mathcal{D}_u$, the prediction bias, $\mathcal{L}_{\text{bce}}$ (§ \ref{['sec:prediction-bias']}), for external data is reweighed using the computed uncertainty regularization scores, § \ref{['sec:optimization']}.
Figure 3: (a) Correlation between uncertainty scores and BCE loss computed between the estimated scores and ground truth. When $\lambda_3 = 1e-3$, as expected, a consistently high negative correlation emerges, demonstrating the effectiveness of the proposed uncertainty quantification method as a reliable proxy for pseudo-label quality. (b) Cumulative Distribution Function (CDF) plots illustrating the progression of average uncertainty regularization scores for each video during training. CDL step 20 has a higher concentration of scores around 1 compared to CDL step 2, while CDL step 2 has a higher concentration around 1 than CDL step 1. This suggests that, as training progresses, there is a higher tendency for scores to have elevated values, indicating more confident pseudo-label predictions. (c) Ablation study on the coefficient of the cosine similarity loss term, $\lambda_3$. (d) Ablation study on the number of segments, $n_{s}$.
Figure 4: Ablation study on the impact of the size of external data.
Figure 5: A comparison between the original annotations (UCF) and the proposed annotations (UCF-R). The green region represents frames labeled as anomalous by both the original and proposed annotations. The red region indicates frames labeled as anomalous by the proposed annotations but not by the original annotations. The unshaded (white) region denotes normal frames. For instance, in the first row, while the original annotations just label frames depicting arson (a person setting the Christmas tree on fire) as anomalous, UCF-R also labels the frames depicting the fire and smoke following arson as anomalous.

Cross-Domain Learning for Video Anomaly Detection with Limited Supervision

TL;DR

Abstract

Cross-Domain Learning for Video Anomaly Detection with Limited Supervision

Authors

TL;DR

Abstract

Table of Contents

Figures (5)