Table of Contents
Fetching ...

SiT-MLP: A Simple MLP with Point-wise Topology Feature Learning for Skeleton-based Action Recognition

Shaojie Zhang, Jianqin Yin, Yonghao Dang, Jiajun Fu

TL;DR

The paper addresses skeleton-based action recognition by moving away from predefined human priors used in GCNs. It introduces STGU, a gate-based, MLP-backed module that learns point-wise, sample-specific spatial topology without priors, culminating in the SiT-MLP model. Empirical results on NTU RGB+D 60/120 and Northwestern-UCLA show competitive accuracy with far fewer parameters and improved efficiency, highlighting the viability of prior-free, MLP-based approaches for skeleton sequences. This work suggests that simple, adaptable MLP architectures can model global joint relationships effectively, offering generalization benefits and real-time deployment potential.

Abstract

Graph convolution networks (GCNs) have achieved remarkable performance in skeleton-based action recognition. However, previous GCN-based methods rely on elaborate human priors excessively and construct complex feature aggregation mechanisms, which limits the generalizability and effectiveness of networks. To solve these problems, we propose a novel Spatial Topology Gating Unit (STGU), an MLP-based variant without extra priors, to capture the co-occurrence topology features that encode the spatial dependency across all joints. In STGU, to learn the point-wise topology features, a new gate-based feature interaction mechanism is introduced to activate the features point-to-point by the attention map generated from the input sample. Based on the STGU, we propose the first MLP-based model, SiT-MLP, for skeleton-based action recognition in this work. Compared with previous methods on three large-scale datasets, SiT-MLP achieves competitive performance. In addition, SiT-MLP reduces the parameters significantly with favorable results. The code will be available at https://github.com/BUPTSJZhang/SiT?MLP.

SiT-MLP: A Simple MLP with Point-wise Topology Feature Learning for Skeleton-based Action Recognition

TL;DR

The paper addresses skeleton-based action recognition by moving away from predefined human priors used in GCNs. It introduces STGU, a gate-based, MLP-backed module that learns point-wise, sample-specific spatial topology without priors, culminating in the SiT-MLP model. Empirical results on NTU RGB+D 60/120 and Northwestern-UCLA show competitive accuracy with far fewer parameters and improved efficiency, highlighting the viability of prior-free, MLP-based approaches for skeleton sequences. This work suggests that simple, adaptable MLP architectures can model global joint relationships effectively, offering generalization benefits and real-time deployment potential.

Abstract

Graph convolution networks (GCNs) have achieved remarkable performance in skeleton-based action recognition. However, previous GCN-based methods rely on elaborate human priors excessively and construct complex feature aggregation mechanisms, which limits the generalizability and effectiveness of networks. To solve these problems, we propose a novel Spatial Topology Gating Unit (STGU), an MLP-based variant without extra priors, to capture the co-occurrence topology features that encode the spatial dependency across all joints. In STGU, to learn the point-wise topology features, a new gate-based feature interaction mechanism is introduced to activate the features point-to-point by the attention map generated from the input sample. Based on the STGU, we propose the first MLP-based model, SiT-MLP, for skeleton-based action recognition in this work. Compared with previous methods on three large-scale datasets, SiT-MLP achieves competitive performance. In addition, SiT-MLP reduces the parameters significantly with favorable results. The code will be available at https://github.com/BUPTSJZhang/SiT?MLP.
Paper Structure (18 sections, 10 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 18 sections, 10 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: Comparsion of performance and parameter size on X-sub benchmark of NTU RGB+D 60 dataset. We report the accuracy as performance on the vertical dimension. The closer to the top-left, the better. Our method (SiT-MLP, in red) archives the highest performance with the fewest parameters.
  • Figure 2: The comparison between the initialized normalized adjacency matrices and the final optimized adjacency matrices in the previous method chen2021channel. The letters a,b, and c denote the self-link matrix, the inward connections, and the outward connections matrix, respectively. The numbers 1 and 2 indicate the initialized adjacency matrix and the final adjacency matrix, respectively.
  • Figure 3: The spatial modeling structure of different approaches: (a) the normally sample-generic modeling module; (b) the channel-wise topology refinement modeling module; (c) the proposed Spatial Topology Gating Unit.
  • Figure 4: Model architecture overview and illustration. The embedding block is adopted to retain the positional information. The STGU module captures the spatial dependency, and the MS-TC module aggregates the temporal information. The global average pooling layer is used to aggregate the global spatial-temporal joint information for the final linear classifier.
  • Figure 5: Framework of the proposed spatial topology gating unit. Feature transformation aims at transforming input features into latent high-dimensional feature space. Point-wise attention modeling builds the entire independent topology attention. Sample-specific aggregation aims to select dynamic features for the current sample. Sample-generic aggregation is for capturing the common feature between all samples. Feature updating aims at fusing and updating the feature after aggregation.
  • ...and 6 more figures