Table of Contents
Fetching ...

GraphMLP: A Graph MLP-Like Architecture for 3D Human Pose Estimation

Wenhao Li, Mengyuan Liu, Hong Liu, Tianyu Guo, Ti Wang, Hao Tang, Nicu Sebe

TL;DR

This work proposes a simple yet effective graph-reinforced MLP-Like architecture, named GraphMLP, that combines MLPs and graph convolutional networks (GCNs) in a global-local-graphical unified architecture for 3D human pose estimation.

Abstract

Modern multi-layer perceptron (MLP) models have shown competitive results in learning visual representations without self-attention. However, existing MLP models are not good at capturing local details and lack prior knowledge of human body configurations, which limits their modeling power for skeletal representation learning. To address these issues, we propose a simple yet effective graph-reinforced MLP-Like architecture, named GraphMLP, that combines MLPs and graph convolutional networks (GCNs) in a global-local-graphical unified architecture for 3D human pose estimation. GraphMLP incorporates the graph structure of human bodies into an MLP model to meet the domain-specific demand of the 3D human pose, while allowing for both local and global spatial interactions. Furthermore, we propose to flexibly and efficiently extend the GraphMLP to the video domain and show that complex temporal dynamics can be effectively modeled in a simple way with negligible computational cost gains in the sequence length. To the best of our knowledge, this is the first MLP-Like architecture for 3D human pose estimation in a single frame and a video sequence. Extensive experiments show that the proposed GraphMLP achieves state-of-the-art performance on two datasets, i.e., Human3.6M and MPI-INF-3DHP. Code and models are available at https://github.com/Vegetebird/GraphMLP.

GraphMLP: A Graph MLP-Like Architecture for 3D Human Pose Estimation

TL;DR

This work proposes a simple yet effective graph-reinforced MLP-Like architecture, named GraphMLP, that combines MLPs and graph convolutional networks (GCNs) in a global-local-graphical unified architecture for 3D human pose estimation.

Abstract

Modern multi-layer perceptron (MLP) models have shown competitive results in learning visual representations without self-attention. However, existing MLP models are not good at capturing local details and lack prior knowledge of human body configurations, which limits their modeling power for skeletal representation learning. To address these issues, we propose a simple yet effective graph-reinforced MLP-Like architecture, named GraphMLP, that combines MLPs and graph convolutional networks (GCNs) in a global-local-graphical unified architecture for 3D human pose estimation. GraphMLP incorporates the graph structure of human bodies into an MLP model to meet the domain-specific demand of the 3D human pose, while allowing for both local and global spatial interactions. Furthermore, we propose to flexibly and efficiently extend the GraphMLP to the video domain and show that complex temporal dynamics can be effectively modeled in a simple way with negligible computational cost gains in the sequence length. To the best of our knowledge, this is the first MLP-Like architecture for 3D human pose estimation in a single frame and a video sequence. Extensive experiments show that the proposed GraphMLP achieves state-of-the-art performance on two datasets, i.e., Human3.6M and MPI-INF-3DHP. Code and models are available at https://github.com/Vegetebird/GraphMLP.
Paper Structure (23 sections, 11 equations, 13 figures, 11 tables)

This paper contains 23 sections, 11 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Performance comparison with MLP-Mixer mlpmixer and GCN stgcn on Human3.6M (a) and MPI-INF-3DHP (b) datasets. The proposed GraphMLP absorbs the advantages of modern MLPs and GCNs to effectively learn skeletal representations, consistently outperforming each of them. The evaluation metric is MPJPE (the lower the better).
  • Figure 2: Overview of the proposed GraphMLP architecture. The left illustrates the skeletal structure of the human body. The 2D joint inputs detected by a 2D pose estimator are sparse and graph-structured data. GraphMLP treats each 2D keypoint as an input token, linearly embeds each of them through the skeleton embedding, feeds the embedded tokens to GraphMLP layers, and finally performs regression on resulting features to predict the 3D pose via the prediction head. Each GraphMLP layer contains one spatial graph MLP (SG-MLP) and one channel graph MLP (CG-MLP). For easy illustration, we show the architecture using a single image as input.
  • Figure 3: (a) The human skeleton graph in physical and symmetrical connections. (b) The adjacency matrix used in the GCN blocks of GraphMLP. Different colors denote the different types of bone connections.
  • Figure 4: Comparison of MLP Layers. (a) MLP-Mixer Layer mlpmixer. (b) Our GraphMLP Layer. Compared with MLP-Mixer, our GraphMLP incorporates graph structural priors into the MLP model via GCN blocks. The MLPs and GCNs are in a paralleled design to model both local and global interactions.
  • Figure 5: Illustration of the process of GraphMLP in the video domain.
  • ...and 8 more figures