Table of Contents
Fetching ...

Contracting Skeletal Kinematics for Human-Related Video Anomaly Detection

Alessandro Flaborea, Guido D'Amely, Stefano D'Arrigo, Marco Aurelio Sterpa, Alessio Sampieri, Fabio Galasso

TL;DR

COSKAD introduces a lightweight, end-to-end skeleton-based anomaly detection framework that contracts normal skeletal motions toward a center on multiple latent manifolds. By combining a space-time separable GCN encoder with a projector and a data-driven center update, it explores Euclidean, spherical, and hyperbolic latent spaces to capture normality and detect anomalies under open-set conditions. The approach achieves state-of-the-art results on human-related versions of UBnormal, ShanghaiTech Campus, and CUHK Avenue, while offering privacy advantages and reduced computational costs relative to appearance-based video methods. The work also presents a human-related UBnormal variant and comprehensive ablations that highlight the benefits of manifold choices, projection non-linearity, and dynamic center updates for reliable anomaly scoring.

Abstract

Detecting the anomaly of human behavior is paramount to timely recognizing endangering situations, such as street fights or elderly falls. However, anomaly detection is complex since anomalous events are rare and because it is an open set recognition task, i.e., what is anomalous at inference has not been observed at training. We propose COSKAD, a novel model that encodes skeletal human motion by a graph convolutional network and learns to COntract SKeletal kinematic embeddings onto a latent hypersphere of minimum volume for Video Anomaly Detection. We propose three latent spaces: the commonly-adopted Euclidean and the novel spherical and hyperbolic. All variants outperform the state-of-the-art on the most recent UBnormal dataset, for which we contribute a human-related version with annotated skeletons. COSKAD sets a new state-of-the-art on the human-related versions of ShanghaiTech Campus and CUHK Avenue, with performance comparable to video-based methods. Source code and dataset will be released upon acceptance.

Contracting Skeletal Kinematics for Human-Related Video Anomaly Detection

TL;DR

COSKAD introduces a lightweight, end-to-end skeleton-based anomaly detection framework that contracts normal skeletal motions toward a center on multiple latent manifolds. By combining a space-time separable GCN encoder with a projector and a data-driven center update, it explores Euclidean, spherical, and hyperbolic latent spaces to capture normality and detect anomalies under open-set conditions. The approach achieves state-of-the-art results on human-related versions of UBnormal, ShanghaiTech Campus, and CUHK Avenue, while offering privacy advantages and reduced computational costs relative to appearance-based video methods. The work also presents a human-related UBnormal variant and comprehensive ablations that highlight the benefits of manifold choices, projection non-linearity, and dynamic center updates for reliable anomaly scoring.

Abstract

Detecting the anomaly of human behavior is paramount to timely recognizing endangering situations, such as street fights or elderly falls. However, anomaly detection is complex since anomalous events are rare and because it is an open set recognition task, i.e., what is anomalous at inference has not been observed at training. We propose COSKAD, a novel model that encodes skeletal human motion by a graph convolutional network and learns to COntract SKeletal kinematic embeddings onto a latent hypersphere of minimum volume for Video Anomaly Detection. We propose three latent spaces: the commonly-adopted Euclidean and the novel spherical and hyperbolic. All variants outperform the state-of-the-art on the most recent UBnormal dataset, for which we contribute a human-related version with annotated skeletons. COSKAD sets a new state-of-the-art on the human-related versions of ShanghaiTech Campus and CUHK Avenue, with performance comparable to video-based methods. Source code and dataset will be released upon acceptance.
Paper Structure (36 sections, 8 equations, 6 figures, 5 tables)

This paper contains 36 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Anomaly score provided by COSKAD on a clip from the UBnormal dataset. COSKAD correctly classifies the motion of the two staggering characters (red skeletons in the upper-right picture) in the last part of the clip as anomalous.
  • Figure 2: The overall architecture of COSKAD. The model combines an STS-GCN-based sofianos21 encoder (light green and light blue blocks) with a projector module (yellow block) After projection, the latent representation (red vector in the figure) is embedded into the latent space. We propose and evaluate 3 variants of the latent space: Euclidean$\mathbb{R}^n$, spherical$\mathbb{S}^n$, and the hyperbolic modeled with the Poincaré Ball $\mathbb{D}^n$. During training, the embeddings are constrained to accumulate in a narrow region in the chosen manifold by reducing the distance between the motion embedding and the common center. The sequences mapped further from the center are interpreted as anomalous during inference.
  • Figure 3: Visualization of the UBnormal test set's latent vectors embedded in three different manifolds: (a) Euclidean, (b) spherical, and (c) hyperbolic. We retain the three dimensions with the highest variance and color-code the points according to their distance from the center, from blue (closest) to red (furthest). Distance is intended as the $L^2$ norm in the Euclidean case, the cosine distance on $\mathbb{S}^n$, and the Poincaré distance for the hyperbolic embeddings. In the hyperbolic case, we highlight in green the hyperboloid onto which the embeddings are projected for better visualization.
  • Figure 4: Examples of extracted poses in HR-UBnormal. The poses are correctly detected even in challenging conditions, e.g., different scales or unusual poses. See section Sample of misestimated human poses for discussion.
  • Figure 5: Examples of misestimations of the pose extractor in HR-UBnormal. Fig. \ref{['fig:2a']} shows a pose that is not present in the scene, Fig. \ref{['fig:2b']} is an example of a pose that is not detected. Fig. \ref{['fig:2c']} is an example of a noisy pose estimation due to the scale of the subject and its partial occlusion. See section Sample of misestimated human poses for discussion.
  • ...and 1 more figures