Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection
Hao Shen, Lu Shi, Wanru Xu, Yigang Cen, Linna Zhang, Gaoyun An
TL;DR
Video Anomaly Detection often relies on pixel-level frame generation, which can miss high-level spatio-temporal context. The authors propose PSTRP, a self-supervised, two-stream Vision Transformer that learns appearance and motion by solving an inter-patch spatio-temporal relation prediction task, augmented with a distance-constraint module; object ROIs are used to form spatio-temporal cubes (STCs) for learning. PSTRP frames the task as predicting the correct spatial and temporal patch orders via output matrices $M_S$ and $M_T$, while enforcing inter-patch relations through $D_{Canberra}$ and $D_{Cosine}$ with losses $L_{Can}$ and $L_{Cos}$, combined with order losses $L_S$ and $L_T$. Anomaly scoring computes frame-level regularities $R$ from the diagonals of $M_S$ and $M_T$, with the final score $S = 1 - R$. Experiments on three benchmarks show competitive performance, including state-of-the-art results on Avenue, and validate the approach's effectiveness over reconstruction/prediction-based and other self-supervised methods, suggesting strong potential for robust, scalable VAD in real-world surveillance settings.
Abstract
Video Anomaly Detection (VAD), aiming to identify abnormalities within a specific context and timeframe, is crucial for intelligent Video Surveillance Systems. While recent deep learning-based VAD models have shown promising results by generating high-resolution frames, they often lack competence in preserving detailed spatial and temporal coherence in video frames. To tackle this issue, we propose a self-supervised learning approach for VAD through an inter-patch relationship prediction task. Specifically, we introduce a two-branch vision transformer network designed to capture deep visual features of video frames, addressing spatial and temporal dimensions responsible for modeling appearance and motion patterns, respectively. The inter-patch relationship in each dimension is decoupled into inter-patch similarity and the order information of each patch. To mitigate memory consumption, we convert the order information prediction task into a multi-label learning problem, and the inter-patch similarity prediction task into a distance matrix regression problem. Comprehensive experiments demonstrate the effectiveness of our method, surpassing pixel-generation-based methods by a significant margin across three public benchmarks. Additionally, our approach outperforms other self-supervised learning-based methods.
