Recognition of Abnormal Events in Surveillance Videos using Weakly Supervised Dual-Encoder Models
Noam Tsfaty, Avishai Weizman, Liav Cohen, Moshe Tshuva, Yehudit Aperstein
TL;DR
The paper tackles surveillance video anomaly detection with weak supervision using only video-level labels. It introduces a dual-encoder MIL framework that fuses spatiotemporal I3D features with TimeSformer transformer representations, processing 32 uniform 16-frame segments per video to produce per-segment scores that are aggregated by top-k pooling. Video-level predictions are trained with binary cross-entropy, achieving an AUC of 90.7% on the UCF-Crime dataset and outperforming a range of baselines. The results demonstrate that combining complementary encoders and weak supervision can yield robust anomaly detection suitable for real-world surveillance applications.
Abstract
We address the challenge of detecting rare and diverse anomalies in surveillance videos using only video-level supervision. Our dual-backbone framework combines convolutional and transformer representations through top-k pooling, achieving 90.7% area under the curve (AUC) on the UCF-Crime dataset.
