Table of Contents
Fetching ...

Recognition of Abnormal Events in Surveillance Videos using Weakly Supervised Dual-Encoder Models

Noam Tsfaty, Avishai Weizman, Liav Cohen, Moshe Tshuva, Yehudit Aperstein

TL;DR

The paper tackles surveillance video anomaly detection with weak supervision using only video-level labels. It introduces a dual-encoder MIL framework that fuses spatiotemporal I3D features with TimeSformer transformer representations, processing 32 uniform 16-frame segments per video to produce per-segment scores that are aggregated by top-k pooling. Video-level predictions are trained with binary cross-entropy, achieving an AUC of 90.7% on the UCF-Crime dataset and outperforming a range of baselines. The results demonstrate that combining complementary encoders and weak supervision can yield robust anomaly detection suitable for real-world surveillance applications.

Abstract

We address the challenge of detecting rare and diverse anomalies in surveillance videos using only video-level supervision. Our dual-backbone framework combines convolutional and transformer representations through top-k pooling, achieving 90.7% area under the curve (AUC) on the UCF-Crime dataset.

Recognition of Abnormal Events in Surveillance Videos using Weakly Supervised Dual-Encoder Models

TL;DR

The paper tackles surveillance video anomaly detection with weak supervision using only video-level labels. It introduces a dual-encoder MIL framework that fuses spatiotemporal I3D features with TimeSformer transformer representations, processing 32 uniform 16-frame segments per video to produce per-segment scores that are aggregated by top-k pooling. Video-level predictions are trained with binary cross-entropy, achieving an AUC of 90.7% on the UCF-Crime dataset and outperforming a range of baselines. The results demonstrate that combining complementary encoders and weak supervision can yield robust anomaly detection suitable for real-world surveillance applications.

Abstract

We address the challenge of detecting rare and diverse anomalies in surveillance videos using only video-level supervision. Our dual-backbone framework combines convolutional and transformer representations through top-k pooling, achieving 90.7% area under the curve (AUC) on the UCF-Crime dataset.

Paper Structure

This paper contains 4 sections, 1 equation, 1 figure, 1 table.

Figures (1)

  • Figure 1: Illustration of the dual-backbone MIL framework. Each video is divided into 32 temporal segments ($v_m$). From each segment ($u_{m,i}$), 16 frames ($x_{m,i}$) are uniformly sampled to form a shorter segment, which is encoded by I3D (convolutional-based) and TimeSformer (transformer-based) encoders. The concatenated and $\ell_2$-normalized features are processed by a compact prediction head and aggregated through top-$k$ pooling to produce the final video-level anomaly prediction.