Table of Contents
Fetching ...

Skeleton-Based Action Recognition with Spatial-Structural Graph Convolution

Jingyao Wang, Emmanuel Bergeret, Issam Falih

TL;DR

This paper tackles skeleton-based action recognition by addressing two key limitations of conventional GCNs: insufficient edge-node information and over-smoothing. It introduces SpSt-GCN, a two-branch network that merges a fixed spatial graph with a data-driven structural graph derived from edge-node sequence similarity computed via Dynamic Time Warping, formulated as $As = -D^{-1} + I$. The approach yields state-of-the-art results on NTU RGB+D and NTU RGB+D 120 while maintaining efficiency, thanks to a modular GCN-TCN backbone and multi-modal fusion of joint, velocity, and bone features. The work advances practical HAR by enabling sample-specific structural relationships and reducing over-smoothing, with potential applicability to broader graph-based recognition tasks.

Abstract

Human Activity Recognition (HAR) is a field of study that focuses on identifying and classifying human activities. Skeleton-based Human Activity Recognition has received much attention in recent years, where Graph Convolutional Network (GCN) based method is widely used and has achieved remarkable results. However, the representation of skeleton data and the issue of over-smoothing in GCN still need to be studied. 1). Compared to central nodes, edge nodes can only aggregate limited neighbor information, and different edge nodes of the human body are always structurally related. However, the information from edge nodes is crucial for fine-grained activity recognition. 2). The Graph Convolutional Network suffers from a significant over-smoothing issue, causing nodes to become increasingly similar as the number of network layers increases. Based on these two ideas, we propose a two-stream graph convolution method called Spatial-Structural GCN (SpSt-GCN). Spatial GCN performs information aggregation based on the topological structure of the human body, and structural GCN performs differentiation based on the similarity of edge node sequences. The spatial connection is fixed, and the human skeleton naturally maintains this topology regardless of the actions performed by humans. However, the structural connection is dynamic and depends on the type of movement the human body is performing. Based on this idea, we also propose an entirely data-driven structural connection, which greatly increases flexibility. We evaluate our method on two large-scale datasets, i.e., NTU RGB+D and NTU RGB+D 120. The proposed method achieves good results while being efficient.

Skeleton-Based Action Recognition with Spatial-Structural Graph Convolution

TL;DR

This paper tackles skeleton-based action recognition by addressing two key limitations of conventional GCNs: insufficient edge-node information and over-smoothing. It introduces SpSt-GCN, a two-branch network that merges a fixed spatial graph with a data-driven structural graph derived from edge-node sequence similarity computed via Dynamic Time Warping, formulated as . The approach yields state-of-the-art results on NTU RGB+D and NTU RGB+D 120 while maintaining efficiency, thanks to a modular GCN-TCN backbone and multi-modal fusion of joint, velocity, and bone features. The work advances practical HAR by enabling sample-specific structural relationships and reducing over-smoothing, with potential applicability to broader graph-based recognition tasks.

Abstract

Human Activity Recognition (HAR) is a field of study that focuses on identifying and classifying human activities. Skeleton-based Human Activity Recognition has received much attention in recent years, where Graph Convolutional Network (GCN) based method is widely used and has achieved remarkable results. However, the representation of skeleton data and the issue of over-smoothing in GCN still need to be studied. 1). Compared to central nodes, edge nodes can only aggregate limited neighbor information, and different edge nodes of the human body are always structurally related. However, the information from edge nodes is crucial for fine-grained activity recognition. 2). The Graph Convolutional Network suffers from a significant over-smoothing issue, causing nodes to become increasingly similar as the number of network layers increases. Based on these two ideas, we propose a two-stream graph convolution method called Spatial-Structural GCN (SpSt-GCN). Spatial GCN performs information aggregation based on the topological structure of the human body, and structural GCN performs differentiation based on the similarity of edge node sequences. The spatial connection is fixed, and the human skeleton naturally maintains this topology regardless of the actions performed by humans. However, the structural connection is dynamic and depends on the type of movement the human body is performing. Based on this idea, we also propose an entirely data-driven structural connection, which greatly increases flexibility. We evaluate our method on two large-scale datasets, i.e., NTU RGB+D and NTU RGB+D 120. The proposed method achieves good results while being efficient.
Paper Structure (19 sections, 12 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 19 sections, 12 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Graph representation of skeleton data. (a) is the graph representation for spatial graph convolution, (b) is the proposed graph representation for structural graph convolution.
  • Figure 2: Spatial-Structural connection visualisation for joint input of NTU RGB+D dataset. The black line represents spatial connection and the red line represents structural connection. The thickness of the red line represents the strength of the structural connection.
  • Figure 3: Two examples of visualization of similarity matrix $D^{-1}$
  • Figure 4: Visualization of learned adjacency matrix. The left matrix is a part of the learned spatial matrix. The right matrix is an example of learned structural matrix.
  • Figure 5: Model structure, where N is the number of action classes, the numbers in the block represent the number of input channels and output channels, /2 represents a stride of 2
  • ...and 1 more figures