Table of Contents
Fetching ...

A Lightweight Video Anomaly Detection Model with Weak Supervision and Adaptive Instance Selection

Yang Wang, Jiaogen Zhou, Jihong Guan

TL;DR

A lightweight video anomaly detection model that can achieve comparable or even superior AUC score compared to the state-of-the-art methods, with a significantly reduced number of model parameters is developed.

Abstract

Video anomaly detection is to determine whether there are any abnormal events, behaviors or objects in a given video, which enables effective and intelligent public safety management. As video anomaly labeling is both time-consuming and expensive, most existing works employ unsupervised or weakly supervised learning methods. This paper focuses on weakly supervised video anomaly detection, in which the training videos are labeled whether or not they contain any anomalies, but there is no information about which frames the anomalies are located. However, the uncertainty of weakly labeled data and the large model size prevent existing methods from wide deployment in real scenarios, especially the resource-limit situations such as edge-computing. In this paper, we develop a lightweight video anomaly detection model. On the one hand, we propose an adaptive instance selection strategy, which is based on the model's current status to select confident instances, thereby mitigating the uncertainty of weakly labeled data and subsequently promoting the model's performance. On the other hand, we design a lightweight multi-level temporal correlation attention module and an hourglass-shaped fully connected layer to construct the model, which can reduce the model parameters to only 0.56\% of the existing methods (e.g. RTFM). Our extensive experiments on two public datasets UCF-Crime and ShanghaiTech show that our model can achieve comparable or even superior AUC score compared to the state-of-the-art methods, with a significantly reduced number of model parameters.

A Lightweight Video Anomaly Detection Model with Weak Supervision and Adaptive Instance Selection

TL;DR

A lightweight video anomaly detection model that can achieve comparable or even superior AUC score compared to the state-of-the-art methods, with a significantly reduced number of model parameters is developed.

Abstract

Video anomaly detection is to determine whether there are any abnormal events, behaviors or objects in a given video, which enables effective and intelligent public safety management. As video anomaly labeling is both time-consuming and expensive, most existing works employ unsupervised or weakly supervised learning methods. This paper focuses on weakly supervised video anomaly detection, in which the training videos are labeled whether or not they contain any anomalies, but there is no information about which frames the anomalies are located. However, the uncertainty of weakly labeled data and the large model size prevent existing methods from wide deployment in real scenarios, especially the resource-limit situations such as edge-computing. In this paper, we develop a lightweight video anomaly detection model. On the one hand, we propose an adaptive instance selection strategy, which is based on the model's current status to select confident instances, thereby mitigating the uncertainty of weakly labeled data and subsequently promoting the model's performance. On the other hand, we design a lightweight multi-level temporal correlation attention module and an hourglass-shaped fully connected layer to construct the model, which can reduce the model parameters to only 0.56\% of the existing methods (e.g. RTFM). Our extensive experiments on two public datasets UCF-Crime and ShanghaiTech show that our model can achieve comparable or even superior AUC score compared to the state-of-the-art methods, with a significantly reduced number of model parameters.
Paper Structure (19 sections, 9 equations, 8 figures, 7 tables)

This paper contains 19 sections, 9 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: The framework of our method Light-WVAD. Our model is based on the multi-instance learning (MIL) framework. Each video is divided into 32 consecutive clips (or instances), which are grouped into a positive instance bag (for abnormal videos) or a negative instance bag (for normal videos). Video features are extracted by I3D. A Multi-level Temporal correlation Attention (MTA) module is designed to capture time-related information, which is then input to a Hourglass-shaped Fully Connected layer (HFC) to calculate the score of each instance. The top-$K$ reliable instances are selected based on an Adaptive Instance Selection (AIS) strategy for subsequent loss calculation.
  • Figure 2: The structure of the multi-level temporal correlation attention (MTA) module. Here, each video is divided into $T$ (32 in this paper) clips, each of which corresponds to an instance.
  • Figure 3: The structures of (a) the traditional fully connected layer (FC), and (b) our hourglass-shaped fully connected layer (HFC).
  • Figure 4: The workflow of our adaptive instance selection (AIS) strategy on a pair of positive and negative bags. It consists of three steps. Here, each red or blue square is an instance. Top-$K$ instances are selected from both the positive and negative bags.
  • Figure 5: The loss curves in model training when using (a) the psarsity loss and (b) our antagomostic loss.
  • ...and 3 more figures