Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation

Mengmeng Cui; Kunbo Zhang; Zhenan Sun

Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation

Mengmeng Cui, Kunbo Zhang, Zhenan Sun

TL;DR

This paper takes a global approach to exploit spatio-temporal information and realise efficient 3D HPE with a concise Graph and Skipped Transformer architecture and achieves superior performances compared with previous state-of-the-arts methods.

Abstract

In recent years, 2D-to-3D pose uplifting in monocular 3D Human Pose Estimation (HPE) has attracted widespread research interest. GNN-based methods and Transformer-based methods have become mainstream architectures due to their advanced spatial and temporal feature learning capacities. However, existing approaches typically construct joint-wise and frame-wise attention alignments in spatial and temporal domains, resulting in dense connections that introduce considerable local redundancy and computational overhead. In this paper, we take a global approach to exploit spatio-temporal information and realise efficient 3D HPE with a concise Graph and Skipped Transformer architecture. Specifically, in Spatial Encoding stage, coarse-grained body parts are deployed to construct Spatial Graph Network with a fully data-driven adaptive topology, ensuring model flexibility and generalizability across various poses. In Temporal Encoding and Decoding stages, a simple yet effective Skipped Transformer is proposed to capture long-range temporal dependencies and implement hierarchical feature aggregation. A straightforward Data Rolling strategy is also developed to introduce dynamic information into 2D pose sequence. Extensive experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks. G-SFormer series methods achieve superior performances compared with previous state-of-the-arts with only around ten percent of parameters and significantly reduced computational complexity. Additionally, G-SFormer also exhibits outstanding robustness to inaccuracies in detected 2D poses.

Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation

TL;DR

Abstract

Paper Structure (17 sections, 12 equations, 10 figures, 3 tables)

This paper contains 17 sections, 12 equations, 10 figures, 3 tables.

Introduction
Related Works
2D-to-3D Pose Lifting
Transformer-based 3D HPE
Method
Spatial Graph Construction
Skipped Transformer for Temporal Modeling
Skipped Transformer for Temporal Encoding
Skipped Transformer for Temporal Decoding
Data Completion Strategies for 2D Pose Input
Loss Function
Experiments
Datasets
Implementation Details
Comparison with State of the Arts
...and 2 more sections

Figures (10)

Figure 1: Left: Comparison of spatio-temporal correlation modeling methods: (a) building joint-wise connections and frame-wise connections for each joint (b) building joint-wise connections and frame-wise connections for the pose representation (c) our G-SFomer: constructing part-based spatial alignments and long-range temporal skipped-connections. Right: MPJPE (mm) vs. MFLOPs of the proposed G-SFormer and competitors on Human3.6M dataset, where marker size indicates model size.
Figure 2: Graph and Skipped Transformer (G-SFormer) consists of three modules: Spatial Graph Encoder for spatial modeling of human body part correlations, Skipped Transformer Encoder and Decoder for temporal feature hierarchical extraction and aggregation. Skip-sampled pose token sets are reordered to the original sequence after encoded by temporal Skipped Transformer, and progressively aggregated by Skipped Multi-head Self-Attention (MSA) to get the target pose representation in the temporal decoding stage.
Figure 3: (a) Architecture of the Spatial Graph Encoder. (b) Updating process of graph nodes. Using node $f_{p5}$ as an example, it is concatenated with other aggregated part features to get $f_{p5}^{'}$.
Figure 4: Data completion strategies for 2D pose input. Taking target frame at t=3 as example, where 2 previous frames need to be completed for a full 9-frame input sequence. Unlike conventional methods which copy edge frame at t=1 multiple times (b), Data Expanding and Data Rolling strategies are proposed to replicate 2D pose step by step (c), or to capture a clip of the pose sequence for completion (d).
Figure 5: Quantitative comparisons with SOTA methods on MPI-INF-3DHP dataset. Best: bold, second best: underlined.
...and 5 more figures

Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation

TL;DR

Abstract

Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)