Table of Contents
Fetching ...

Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection

Hao Shen, Lu Shi, Wanru Xu, Yigang Cen, Linna Zhang, Gaoyun An

TL;DR

Video Anomaly Detection often relies on pixel-level frame generation, which can miss high-level spatio-temporal context. The authors propose PSTRP, a self-supervised, two-stream Vision Transformer that learns appearance and motion by solving an inter-patch spatio-temporal relation prediction task, augmented with a distance-constraint module; object ROIs are used to form spatio-temporal cubes (STCs) for learning. PSTRP frames the task as predicting the correct spatial and temporal patch orders via output matrices $M_S$ and $M_T$, while enforcing inter-patch relations through $D_{Canberra}$ and $D_{Cosine}$ with losses $L_{Can}$ and $L_{Cos}$, combined with order losses $L_S$ and $L_T$. Anomaly scoring computes frame-level regularities $R$ from the diagonals of $M_S$ and $M_T$, with the final score $S = 1 - R$. Experiments on three benchmarks show competitive performance, including state-of-the-art results on Avenue, and validate the approach's effectiveness over reconstruction/prediction-based and other self-supervised methods, suggesting strong potential for robust, scalable VAD in real-world surveillance settings.

Abstract

Video Anomaly Detection (VAD), aiming to identify abnormalities within a specific context and timeframe, is crucial for intelligent Video Surveillance Systems. While recent deep learning-based VAD models have shown promising results by generating high-resolution frames, they often lack competence in preserving detailed spatial and temporal coherence in video frames. To tackle this issue, we propose a self-supervised learning approach for VAD through an inter-patch relationship prediction task. Specifically, we introduce a two-branch vision transformer network designed to capture deep visual features of video frames, addressing spatial and temporal dimensions responsible for modeling appearance and motion patterns, respectively. The inter-patch relationship in each dimension is decoupled into inter-patch similarity and the order information of each patch. To mitigate memory consumption, we convert the order information prediction task into a multi-label learning problem, and the inter-patch similarity prediction task into a distance matrix regression problem. Comprehensive experiments demonstrate the effectiveness of our method, surpassing pixel-generation-based methods by a significant margin across three public benchmarks. Additionally, our approach outperforms other self-supervised learning-based methods.

Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection

TL;DR

Video Anomaly Detection often relies on pixel-level frame generation, which can miss high-level spatio-temporal context. The authors propose PSTRP, a self-supervised, two-stream Vision Transformer that learns appearance and motion by solving an inter-patch spatio-temporal relation prediction task, augmented with a distance-constraint module; object ROIs are used to form spatio-temporal cubes (STCs) for learning. PSTRP frames the task as predicting the correct spatial and temporal patch orders via output matrices and , while enforcing inter-patch relations through and with losses and , combined with order losses and . Anomaly scoring computes frame-level regularities from the diagonals of and , with the final score . Experiments on three benchmarks show competitive performance, including state-of-the-art results on Avenue, and validate the approach's effectiveness over reconstruction/prediction-based and other self-supervised methods, suggesting strong potential for robust, scalable VAD in real-world surveillance settings.

Abstract

Video Anomaly Detection (VAD), aiming to identify abnormalities within a specific context and timeframe, is crucial for intelligent Video Surveillance Systems. While recent deep learning-based VAD models have shown promising results by generating high-resolution frames, they often lack competence in preserving detailed spatial and temporal coherence in video frames. To tackle this issue, we propose a self-supervised learning approach for VAD through an inter-patch relationship prediction task. Specifically, we introduce a two-branch vision transformer network designed to capture deep visual features of video frames, addressing spatial and temporal dimensions responsible for modeling appearance and motion patterns, respectively. The inter-patch relationship in each dimension is decoupled into inter-patch similarity and the order information of each patch. To mitigate memory consumption, we convert the order information prediction task into a multi-label learning problem, and the inter-patch similarity prediction task into a distance matrix regression problem. Comprehensive experiments demonstrate the effectiveness of our method, surpassing pixel-generation-based methods by a significant margin across three public benchmarks. Additionally, our approach outperforms other self-supervised learning-based methods.
Paper Structure (16 sections, 11 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 16 sections, 11 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: The extraction process of a STC and the dividing process of spatial and temporal cubes.
  • Figure 2: The framework of PSTRP. STC is divided into small patches spatially and temporally. After passing through embedding layer, randomized positional coding are embedded to these patches (The correct position order is determined by the order of colors from darkest to lightest). One vision transformer is dedicated to spatial patch order prediction and appearance feature learning (the upper transformer encoder module), while another is focused on temporal patch order prediction and motion feature capturing (the transformer encoder module in the lower part). The predictions of the model indicate the anomaly scores.
  • Figure 3: Relation matrix.
  • Figure 4: Illustrations of anomaly score that denotes the reconstruction error in Ped2, Avenue and SHTech datasets. Orange region in graph denotes the time sequences that abnormal situation exists in video frames. As shown in graph, anomaly scores (Red curve) dramatically increase with the high reconstruction error when the abnormal frames start.