Table of Contents
Fetching ...

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

Sijie Yan, Yuanjun Xiong, Dahua Lin

TL;DR

This work tackles skeleton-based action recognition by modeling human joints as a spatial-temporal graph and applying learned graph convolutions. The Spatial-Temporal Graph Convolutional Network (ST-GCN) builds a graph with intra-frame joint connections and inter-frame temporal links, using partitioned, learnable convolutional kernels and optional edge weighting to capture local structure and temporal dynamics. Through ablations and large-scale evaluations on Kinetics and NTU-RGB+D, ST-GCN achieves state-of-the-art results among skeleton-based approaches and demonstrates complementary information to RGB-based methods. The approach is flexible to datasets with different joint configurations and shows strong potential for integration with multi-modal video representations.

Abstract

Dynamics of human body skeletons convey significant information for human action recognition. Conventional approaches for modeling skeletons usually rely on hand-crafted parts or traversal rules, thus resulting in limited expressive power and difficulties of generalization. In this work, we propose a novel model of dynamic skeletons called Spatial-Temporal Graph Convolutional Networks (ST-GCN), which moves beyond the limitations of previous methods by automatically learning both the spatial and temporal patterns from data. This formulation not only leads to greater expressive power but also stronger generalization capability. On two large datasets, Kinetics and NTU-RGBD, it achieves substantial improvements over mainstream methods.

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

TL;DR

This work tackles skeleton-based action recognition by modeling human joints as a spatial-temporal graph and applying learned graph convolutions. The Spatial-Temporal Graph Convolutional Network (ST-GCN) builds a graph with intra-frame joint connections and inter-frame temporal links, using partitioned, learnable convolutional kernels and optional edge weighting to capture local structure and temporal dynamics. Through ablations and large-scale evaluations on Kinetics and NTU-RGB+D, ST-GCN achieves state-of-the-art results among skeleton-based approaches and demonstrates complementary information to RGB-based methods. The approach is flexible to datasets with different joint configurations and shows strong potential for integration with multi-modal video representations.

Abstract

Dynamics of human body skeletons convey significant information for human action recognition. Conventional approaches for modeling skeletons usually rely on hand-crafted parts or traversal rules, thus resulting in limited expressive power and difficulties of generalization. In this work, we propose a novel model of dynamic skeletons called Spatial-Temporal Graph Convolutional Networks (ST-GCN), which moves beyond the limitations of previous methods by automatically learning both the spatial and temporal patterns from data. This formulation not only leads to greater expressive power but also stronger generalization capability. On two large datasets, Kinetics and NTU-RGBD, it achieves substantial improvements over mainstream methods.

Paper Structure

This paper contains 31 sections, 10 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The spatial temporal graph of a skeleton sequence used in this work where the proposed ST-GCN operate on. Blue dots denote the body joints. The intra-body edges between body joints are defined based on the natural connections in human bodies. The inter-frame edges connect the same joints between consecutive frames. Joint coordinates are used as inputs to the ST-GCN.
  • Figure 2: We perform pose estimation on videos and construct spatial temporal graph on skeleton sequences. Multiple layers of spatial-temporal graph convolution (ST-GCN) will be applied and gradually generate higher-level feature maps on the graph. It will then be classified by the standard Softmax classifier to the corresponding action category.
  • Figure 3: The proposed partitioning strategies for constructing convolution operations. From left to right: (a) An example frame of input skeleton. Body joints are drawn with blue dots. The receptive fields of a filter with $D=1$ are drawn with red dashed circles. (b)Uni-labeling partitioning strategy, where all nodes in a neighborhood has the same label (green). (c)Distance partitioning. The two subsets are the root node itself with distance $0$ (green) and other neighboring points with distance $1$. (blue). (d)Spatial configuration partitioning. The nodes are labeled according to their distances to the skeleton gravity center (black cross) compared with that of the root node (green). Centripetal nodes have shorter distances (blue), while centrifugal nodes have longer distances (yellow) than the root node.