DG-STGCN: Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition

Haodong Duan; Jiaqi Wang; Kai Chen; Dahua Lin

DG-STGCN: Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition

Haodong Duan, Jiaqi Wang, Kai Chen, Dahua Lin

TL;DR

DG-STGCN introduces dynamic group spatial-temporal modeling for skeleton-based action recognition by learning multi-group spatial graphs (DG-GCN) and multi-branch temporal convolutions with dynamic joint-skeleton fusion (DG-TCN). It eliminates reliance on hand-crafted skeleton topology, enabling data-driven, adaptable inter-joint correlations and multi-level temporal patterns. A strong temporal augmentation, Uniform Sampling, further regularizes training and improves generalization. Empirical results across NTURGB+D, Kinetics-Skeleton, BABEL, and Toyota SmartHome show state-of-the-art performance with efficient computation, validating the effectiveness of the dynamic group approach.

Abstract

Graph convolution networks (GCN) have been widely used in skeleton-based action recognition. We note that existing GCN-based approaches primarily rely on prescribed graphical structures (ie., a manually defined topology of skeleton joints), which limits their flexibility to capture complicated correlations between joints. To move beyond this limitation, we propose a new framework for skeleton-based action recognition, namely Dynamic Group Spatio-Temporal GCN (DG-STGCN). It consists of two modules, DG-GCN and DG-TCN, respectively, for spatial and temporal modeling. In particular, DG-GCN uses learned affinity matrices to capture dynamic graphical structures instead of relying on a prescribed one, while DG-TCN performs group-wise temporal convolutions with varying receptive fields and incorporates a dynamic joint-skeleton fusion module for adaptive multi-level temporal modeling. On a wide range of benchmarks, including NTURGB+D, Kinetics-Skeleton, BABEL, and Toyota SmartHome, DG-STGCN consistently outperforms state-of-the-art methods, often by a notable margin.

DG-STGCN: Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition

TL;DR

Abstract

Paper Structure (21 sections, 4 equations, 6 figures, 8 tables)

This paper contains 21 sections, 4 equations, 6 figures, 8 tables.

Introduction
Related Works
Graph Neural Networks
Skeleton-based Action Recognition
DG-STGCN
ST-GCN Recap
DG-GCN: Dynamic Spatial Modeling from Scratch
DG-TCN: Multi-group TCN with Dynamic Joint-Skeleton Fusion
Uniform Sampling as Temporal Data Augmentation
Experiment
Datasets
Implementation Details
Ablation Study
Preliminary: Is a pre-defined topology indispensable?
Dynamic Group GCN (DG-GCN).
...and 6 more sections

Figures (6)

Figure 1: GFLOPs vs. accuracies on two NTURGB+D 120 benchmarks.
Figure 2: The typical framework of GCNs for skeleton-based action recognition. (a) A GCN consists of $N$ stacked GCN blocks, each consists of a spatial module and a temporal module. (b) The spatial module performs feature fusion across joints with coefficient matrices $A$ (pre-defined / learned). (c) The temporal module learns temporal features with 1D temporal convolutions.
Figure 3: The architecture of the dynamic group GCN (DG-GCN).
Figure 4: The architecture of the dynamic group TCN (DG-TCN). 'D' indicates dilation.
Figure 5: The visualization of Uniform Sampling and two alternatives.
...and 1 more figures

DG-STGCN: Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition

TL;DR

Abstract

DG-STGCN: Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (6)