Table of Contents
Fetching ...

Human-Centric Video Anomaly Detection Through Spatio-Temporal Pose Tokenization and Transformer

Ghazal Alinezhad Noghre, Armin Danesh Pazho, Hamed Tabkhi

TL;DR

SPARTA introduces a light, transformer-based approach for human-centric VAD by leveraging Spatio-Temporal Pose and Relative Pose (ST-PRP) tokenization within a Unified Encoder Twin Decoders (UETD) architecture. The dual decoders (CTD and FTD) provide complementary current and future anomaly cues, which are fused into a robust frame-level score. Self-supervised training on normal data plus pose-based inputs yields state-of-the-art AUC-ROC and low EER across four datasets, with strong generalization and privacy advantages over pixel-based methods. The method demonstrates the value of explicit pose-tokenization and dual-decoder synergy for detecting varied and open-set anomalies in real-world videos.

Abstract

Video Anomaly Detection (VAD) presents a significant challenge in computer vision, particularly due to the unpredictable and infrequent nature of anomalous events, coupled with the diverse and dynamic environments in which they occur. Human-centric VAD, a specialized area within this domain, faces additional complexities, including variations in human behavior, potential biases in data, and substantial privacy concerns related to human subjects. These issues complicate the development of models that are both robust and generalizable. To address these challenges, recent advancements have focused on pose-based VAD, which leverages human pose as a high-level feature to mitigate privacy concerns, reduce appearance biases, and minimize background interference. In this paper, we introduce SPARTA, a novel transformer-based architecture designed specifically for human-centric pose-based VAD. SPARTA introduces an innovative Spatio-Temporal Pose and Relative Pose (ST-PRP) tokenization method that produces an enriched representation of human motion over time. This approach ensures that the transformer's attention mechanism captures both spatial and temporal patterns simultaneously, rather than focusing on only one aspect. The addition of the relative pose further emphasizes subtle deviations from normal human movements. The architecture's core, a novel Unified Encoder Twin Decoders (UETD) transformer, significantly improves the detection of anomalous behaviors in video data. Extensive evaluations across multiple benchmark datasets demonstrate that SPARTA consistently outperforms existing methods, establishing a new state-of-the-art in pose-based VAD.

Human-Centric Video Anomaly Detection Through Spatio-Temporal Pose Tokenization and Transformer

TL;DR

SPARTA introduces a light, transformer-based approach for human-centric VAD by leveraging Spatio-Temporal Pose and Relative Pose (ST-PRP) tokenization within a Unified Encoder Twin Decoders (UETD) architecture. The dual decoders (CTD and FTD) provide complementary current and future anomaly cues, which are fused into a robust frame-level score. Self-supervised training on normal data plus pose-based inputs yields state-of-the-art AUC-ROC and low EER across four datasets, with strong generalization and privacy advantages over pixel-based methods. The method demonstrates the value of explicit pose-tokenization and dual-decoder synergy for detecting varied and open-set anomalies in real-world videos.

Abstract

Video Anomaly Detection (VAD) presents a significant challenge in computer vision, particularly due to the unpredictable and infrequent nature of anomalous events, coupled with the diverse and dynamic environments in which they occur. Human-centric VAD, a specialized area within this domain, faces additional complexities, including variations in human behavior, potential biases in data, and substantial privacy concerns related to human subjects. These issues complicate the development of models that are both robust and generalizable. To address these challenges, recent advancements have focused on pose-based VAD, which leverages human pose as a high-level feature to mitigate privacy concerns, reduce appearance biases, and minimize background interference. In this paper, we introduce SPARTA, a novel transformer-based architecture designed specifically for human-centric pose-based VAD. SPARTA introduces an innovative Spatio-Temporal Pose and Relative Pose (ST-PRP) tokenization method that produces an enriched representation of human motion over time. This approach ensures that the transformer's attention mechanism captures both spatial and temporal patterns simultaneously, rather than focusing on only one aspect. The addition of the relative pose further emphasizes subtle deviations from normal human movements. The architecture's core, a novel Unified Encoder Twin Decoders (UETD) transformer, significantly improves the detection of anomalous behaviors in video data. Extensive evaluations across multiple benchmark datasets demonstrate that SPARTA consistently outperforms existing methods, establishing a new state-of-the-art in pose-based VAD.
Paper Structure (20 sections, 7 equations, 7 figures, 8 tables)

This paper contains 20 sections, 7 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: A conceptual overview of SPARTA. SPARTA assigns higher scores to anomalous pose sequences. The final frame score is the maximum score of all individuals in the scene.
  • Figure 2: SPARTA architecture. ST-PRP tokenization reorders and prepares input pose sequences for being fed to the UETD transformer core. The UETD transformer core consists of a unified pose transformer encoder and twin decoders for CTD and FTD. The MSE loss of both CTD and FTD branches is used to calculate the Current Score (CS) and Future Score (FS), respectively. The average of the two scores is calculated to find the Hybrid Score (HS). Please note that $a$ and $b$ are constant multipliers both set to $0.5$ for calculating the HS. Red and blue represent SPARTA-F and SPARTA-C data flows respectively.
  • Figure 3: SPARTA Spatio-Temporal Pose and Relative Pose (ST-PRP) Tokenization Schema. $k$ is the number of keypoints, $\beta$ is the input window size, $\Delta$ shows relative coordinates and $x(t, k)$ and $y(t, k)$ are the coordinates of $k^{th}$ keypoint in time step $t$.
  • Figure 4: Output anomaly scores of SPARTA-H for each frame of clip $01\_0025$ from the SHT dataset liu2018future. The red area on the plot indicates the ground truth anomalous frames. In this clip, the anomalous behavior is a person riding a bike on the sidewalk, shown by the red rectangle.
  • Figure 5: Output anomaly scores of SPARTA-H for each frame of clip $04\_093\_1$ from the CHAD dataset danesh2023chad. The red area on the plot indicates the ground truth anomalous frames. In this clip, the anomalous behavior is two people fighting shown by the red rectangle.
  • ...and 2 more figures