Table of Contents
Fetching ...

Multi-hop graph transformer network for 3D human pose estimation

Zaedul Islam, A. Ben Hamza

TL;DR

This paper tackles 2D-to-3D pose estimation in videos under occlusion and depth ambiguity by introducing MGT-Net, a spatio-temporal model that fuses graph attention with multi-hop graph convolutions and dilated convolutions. The architecture comprises skeleton embedding, a graph attention block with a learnable adjacency, and a multi-hop GCN block that disentanbles neighborhoods to capture long-range dependencies efficiently. Through extensive experiments on Human3.6M and MPI-INF-3DHP, MGT-Net achieves competitive MPJPE and PA-MPJPE while maintaining a small parameter footprint, and demonstrates strong generalization across datasets. The work shows meaningful improvements over several baselines and highlights the value of integrating graph-structured processing with transformer-style attention for robust 3D pose estimation.

Abstract

Accurate 3D human pose estimation is a challenging task due to occlusion and depth ambiguity. In this paper, we introduce a multi-hop graph transformer network designed for 2D-to-3D human pose estimation in videos by leveraging the strengths of multi-head self-attention and multi-hop graph convolutional networks with disentangled neighborhoods to capture spatio-temporal dependencies and handle long-range interactions. The proposed network architecture consists of a graph attention block composed of stacked layers of multi-head self-attention and graph convolution with learnable adjacency matrix, and a multi-hop graph convolutional block comprised of multi-hop convolutional and dilated convolutional layers. The combination of multi-head self-attention and multi-hop graph convolutional layers enables the model to capture both local and global dependencies, while the integration of dilated convolutional layers enhances the model's ability to handle spatial details required for accurate localization of the human body joints. Extensive experiments demonstrate the effectiveness and generalization ability of our model, achieving competitive performance on benchmark datasets.

Multi-hop graph transformer network for 3D human pose estimation

TL;DR

This paper tackles 2D-to-3D pose estimation in videos under occlusion and depth ambiguity by introducing MGT-Net, a spatio-temporal model that fuses graph attention with multi-hop graph convolutions and dilated convolutions. The architecture comprises skeleton embedding, a graph attention block with a learnable adjacency, and a multi-hop GCN block that disentanbles neighborhoods to capture long-range dependencies efficiently. Through extensive experiments on Human3.6M and MPI-INF-3DHP, MGT-Net achieves competitive MPJPE and PA-MPJPE while maintaining a small parameter footprint, and demonstrates strong generalization across datasets. The work shows meaningful improvements over several baselines and highlights the value of integrating graph-structured processing with transformer-style attention for robust 3D pose estimation.

Abstract

Accurate 3D human pose estimation is a challenging task due to occlusion and depth ambiguity. In this paper, we introduce a multi-hop graph transformer network designed for 2D-to-3D human pose estimation in videos by leveraging the strengths of multi-head self-attention and multi-hop graph convolutional networks with disentangled neighborhoods to capture spatio-temporal dependencies and handle long-range interactions. The proposed network architecture consists of a graph attention block composed of stacked layers of multi-head self-attention and graph convolution with learnable adjacency matrix, and a multi-hop graph convolutional block comprised of multi-hop convolutional and dilated convolutional layers. The combination of multi-head self-attention and multi-hop graph convolutional layers enables the model to capture both local and global dependencies, while the integration of dilated convolutional layers enhances the model's ability to handle spatial details required for accurate localization of the human body joints. Extensive experiments demonstrate the effectiveness and generalization ability of our model, achieving competitive performance on benchmark datasets.
Paper Structure (18 sections, 10 equations, 6 figures, 9 tables)

This paper contains 18 sections, 10 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Performance and model size comparison between our model and state-of-the-art temporal methods for 3D human pose estimation, including PoseFormer PoseFormer:2021, VideoPose3D pavllo20193d, ST-GCN YujunCai:19, SRNet zeng2020srnet, Attention3D liu2020attention, Anatomy3D chen2021anatomy, and HTNet cai2023htnet. Lower Mean Per Joint Position Error (MPJPE) values indicate better performance. Evaluation is conducted on the Human3.6M dataset using detected 2D joints as input.
  • Figure 2: Visual comparison between the standard graph convolution, which only considers the 1-hop neighbors, and the multi-hop graph convolution, which takes into account neighbors at different distances. The node label $k\in\{0,\dots,5\}$ indicates that the corresponding body joint is a $k$-hop neighbor of the pelvis (i.e., root node denoted by 0).
  • Figure 3: Comparing the sparsity of the $k$-th power of the adjacency matrix (top row) and the $k$-adjacency matrix (bottom row). As the value of $k$ increases, the $k$-th power representation tends to become denser, while the $k$-adjacency matrix maintains higher sparsity. The sparsity of the $k$-adjacency matrix makes it an efficient choice for capturing long-range dependencies in the multi-hop GCN with disentangled neighborhoods, reducing computational complexity and memory usage.
  • Figure 4: Network architecture of the proposed MGT-Net for 3D human pose estimation. Our model takes a sequence of 2D pose coordinates as input and generates 3D pose predictions as output. The core building blocks of the network are a graph attention block and a multi-hop graph convolutional block, which are stacked together. We use a total of five layers for these stacks. In the graph attention block, the multi-head attention layer is followed by two consecutive graph convolutional layers with learnable adjacency matrix (LAM-GConv). The multi-hop graph convolutional block is composed of two subblocks, each of which comprises a multi-hopGConv layer, followed by a dilated convolutional layer.
  • Figure 5: Visual comparison between MGT-Net, MGCN and ground truth on the Human3.6M test set. Compared to MGCN, our model is able to produce better predictions.
  • ...and 1 more figures