Table of Contents
Fetching ...

DG-STGCN: Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition

Haodong Duan, Jiaqi Wang, Kai Chen, Dahua Lin

TL;DR

DG-STGCN introduces dynamic group spatial-temporal modeling for skeleton-based action recognition by learning multi-group spatial graphs (DG-GCN) and multi-branch temporal convolutions with dynamic joint-skeleton fusion (DG-TCN). It eliminates reliance on hand-crafted skeleton topology, enabling data-driven, adaptable inter-joint correlations and multi-level temporal patterns. A strong temporal augmentation, Uniform Sampling, further regularizes training and improves generalization. Empirical results across NTURGB+D, Kinetics-Skeleton, BABEL, and Toyota SmartHome show state-of-the-art performance with efficient computation, validating the effectiveness of the dynamic group approach.

Abstract

Graph convolution networks (GCN) have been widely used in skeleton-based action recognition. We note that existing GCN-based approaches primarily rely on prescribed graphical structures (ie., a manually defined topology of skeleton joints), which limits their flexibility to capture complicated correlations between joints. To move beyond this limitation, we propose a new framework for skeleton-based action recognition, namely Dynamic Group Spatio-Temporal GCN (DG-STGCN). It consists of two modules, DG-GCN and DG-TCN, respectively, for spatial and temporal modeling. In particular, DG-GCN uses learned affinity matrices to capture dynamic graphical structures instead of relying on a prescribed one, while DG-TCN performs group-wise temporal convolutions with varying receptive fields and incorporates a dynamic joint-skeleton fusion module for adaptive multi-level temporal modeling. On a wide range of benchmarks, including NTURGB+D, Kinetics-Skeleton, BABEL, and Toyota SmartHome, DG-STGCN consistently outperforms state-of-the-art methods, often by a notable margin.

DG-STGCN: Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition

TL;DR

DG-STGCN introduces dynamic group spatial-temporal modeling for skeleton-based action recognition by learning multi-group spatial graphs (DG-GCN) and multi-branch temporal convolutions with dynamic joint-skeleton fusion (DG-TCN). It eliminates reliance on hand-crafted skeleton topology, enabling data-driven, adaptable inter-joint correlations and multi-level temporal patterns. A strong temporal augmentation, Uniform Sampling, further regularizes training and improves generalization. Empirical results across NTURGB+D, Kinetics-Skeleton, BABEL, and Toyota SmartHome show state-of-the-art performance with efficient computation, validating the effectiveness of the dynamic group approach.

Abstract

Graph convolution networks (GCN) have been widely used in skeleton-based action recognition. We note that existing GCN-based approaches primarily rely on prescribed graphical structures (ie., a manually defined topology of skeleton joints), which limits their flexibility to capture complicated correlations between joints. To move beyond this limitation, we propose a new framework for skeleton-based action recognition, namely Dynamic Group Spatio-Temporal GCN (DG-STGCN). It consists of two modules, DG-GCN and DG-TCN, respectively, for spatial and temporal modeling. In particular, DG-GCN uses learned affinity matrices to capture dynamic graphical structures instead of relying on a prescribed one, while DG-TCN performs group-wise temporal convolutions with varying receptive fields and incorporates a dynamic joint-skeleton fusion module for adaptive multi-level temporal modeling. On a wide range of benchmarks, including NTURGB+D, Kinetics-Skeleton, BABEL, and Toyota SmartHome, DG-STGCN consistently outperforms state-of-the-art methods, often by a notable margin.
Paper Structure (21 sections, 4 equations, 6 figures, 8 tables)

This paper contains 21 sections, 4 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: GFLOPs vs. accuracies on two NTURGB+D 120 benchmarks.
  • Figure 2: The typical framework of GCNs for skeleton-based action recognition. (a) A GCN consists of $N$ stacked GCN blocks, each consists of a spatial module and a temporal module. (b) The spatial module performs feature fusion across joints with coefficient matrices $A$ (pre-defined / learned). (c) The temporal module learns temporal features with 1D temporal convolutions.
  • Figure 3: The architecture of the dynamic group GCN (DG-GCN).
  • Figure 4: The architecture of the dynamic group TCN (DG-TCN). 'D' indicates dilation.
  • Figure 5: The visualization of Uniform Sampling and two alternatives.
  • ...and 1 more figures