Temporal Divide-and-Conquer Anomaly Actions Localization in Semi-Supervised Videos with Hierarchical Transformer

Nada Osman; Marwan Torki

Temporal Divide-and-Conquer Anomaly Actions Localization in Semi-Supervised Videos with Hierarchical Transformer

Nada Osman, Marwan Torki

TL;DR

This work proposes a hierarchical transformer model designed to evaluate the significance of observed actions in anomalous videos with a divide-and-conquer strategy along the temporal axis and proves its ability to interpret the observed actions within videos and localize the anomalous ones.

Abstract

Anomaly action detection and localization play an essential role in security and advanced surveillance systems. However, due to the tremendous amount of surveillance videos, most of the available data for the task is unlabeled or semi-labeled with the video class known, but the location of the anomaly event is unknown. In this work, we target anomaly localization in semi-supervised videos. While the mainstream direction in addressing this task is focused on segment-level multi-instance learning and the generation of pseudo labels, we aim to explore a promising yet unfulfilled direction to solve the problem by learning the temporal relations within videos in order to locate anomaly events. To this end, we propose a hierarchical transformer model designed to evaluate the significance of observed actions in anomalous videos with a divide-and-conquer strategy along the temporal axis. Our approach segments a parent video hierarchically into multiple temporal children instances and measures the influence of the children nodes in classifying the abnormality of the parent video. Evaluating our model on two well-known anomaly detection datasets, UCF-crime and ShanghaiTech, proves its ability to interpret the observed actions within videos and localize the anomalous ones. Our proposed approach outperforms previous works relying on segment-level multiple-instance learning approaches while reaching a promising performance compared to the more recent pseudo-labeling-based approaches.

Temporal Divide-and-Conquer Anomaly Actions Localization in Semi-Supervised Videos with Hierarchical Transformer

TL;DR

Abstract

Paper Structure (18 sections, 7 equations, 5 figures, 5 tables)

This paper contains 18 sections, 7 equations, 5 figures, 5 tables.

Introduction
Related Work
Anomaly Detection
Class Activation Maps Learning
Temporal Hierarchical Modeling
Temporal Divide-and-Conquer Approach
Double Scale Features Extractor
Hierarchical Transformer Layers
Prediction Head
Localization Approach
Experimental Results
Dataset
Evaluation Metrics
Implementation details
Results
...and 3 more sections

Figures (5)

Figure 1: Each video is split into $N$ segments. A normal video ($y_v=0$) contains only normal segments ($y_s^i=0, \forall i\in[1:N]$). While an anomaly video ($y_v=1$) contains at least one anomaly segment ($y_s^i=1, \exists i\in[1:N]$). Our approach employs a hierarchical transformer model to classify the abnormality of the whole video, in addition to producing abnormality scores for the individual segments. This approach differs from previous works that overlook the context of the entire video and classify individual segments independently.
Figure 2: Our divide-and-conquer transformer-based model operates by taking the segmented video as input, where the video is divided into $N$ segments. These segments undergo feature extraction using our Double Scale Features Extractor (DS-$\Phi$) module. Subsequently, the extracted features are passed to the hierarchical transformer layers for classification. At the first level ($TL_1^1$), the model generates video classification ($y_1^1 = y_v$), and at each subsequent level ($T_k$), it produces sup-video classification $y_k^j, \quad \forall j \in [1, 2, 3, \dots 2^{k-1}]$.
Figure 3: Visualization of our localization Approach. Assuming the localization is conducted at level $k$, $w_k$ is obtained from the attention weights of the class query inside our self-attention layers, $h_k$ is the averaged-pooled encodings produced by the transformer, and $y_k$ is the stacked sub-predictions at $k$. The estimated abnormality is computed as in Equation (\ref{['eq:seg_cls']}).
Figure 4: Qualitative example on accurate anomaly localization from UCF-Crime. Anomaly score estimations are provided for the anomaly action "Assault", which begins at segment $S_5$ and continues until the final segment $S_{32}$. Here, $e_i$ represents the score estimated in (\ref{['eq:seg_cls']}), $a_i$ denotes the segment's activation as defined in (\ref{['eq:ai']}), and $t_i$ refers to the attention weights described in (\ref{['eq:ti']}).
Figure 5: Qualitative example on slightly misled anomaly localization from UCF-Crime. The heat maps illustrate anomaly score estimations for the "Arrest" event, which initiates at segment $S_{17}$ and extends through segment $S_{21}$. The illustrated heat maps are $p_i$, $t_i$, $a_i$, the aggregated estimation $e_i$, and the ground truth labeling of the segments.

Temporal Divide-and-Conquer Anomaly Actions Localization in Semi-Supervised Videos with Hierarchical Transformer

TL;DR

Abstract

Temporal Divide-and-Conquer Anomaly Actions Localization in Semi-Supervised Videos with Hierarchical Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (5)