Table of Contents
Fetching ...

Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation

Mengmeng Cui, Kunbo Zhang, Zhenan Sun

TL;DR

This paper takes a global approach to exploit spatio-temporal information and realise efficient 3D HPE with a concise Graph and Skipped Transformer architecture and achieves superior performances compared with previous state-of-the-arts methods.

Abstract

In recent years, 2D-to-3D pose uplifting in monocular 3D Human Pose Estimation (HPE) has attracted widespread research interest. GNN-based methods and Transformer-based methods have become mainstream architectures due to their advanced spatial and temporal feature learning capacities. However, existing approaches typically construct joint-wise and frame-wise attention alignments in spatial and temporal domains, resulting in dense connections that introduce considerable local redundancy and computational overhead. In this paper, we take a global approach to exploit spatio-temporal information and realise efficient 3D HPE with a concise Graph and Skipped Transformer architecture. Specifically, in Spatial Encoding stage, coarse-grained body parts are deployed to construct Spatial Graph Network with a fully data-driven adaptive topology, ensuring model flexibility and generalizability across various poses. In Temporal Encoding and Decoding stages, a simple yet effective Skipped Transformer is proposed to capture long-range temporal dependencies and implement hierarchical feature aggregation. A straightforward Data Rolling strategy is also developed to introduce dynamic information into 2D pose sequence. Extensive experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks. G-SFormer series methods achieve superior performances compared with previous state-of-the-arts with only around ten percent of parameters and significantly reduced computational complexity. Additionally, G-SFormer also exhibits outstanding robustness to inaccuracies in detected 2D poses.

Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation

TL;DR

This paper takes a global approach to exploit spatio-temporal information and realise efficient 3D HPE with a concise Graph and Skipped Transformer architecture and achieves superior performances compared with previous state-of-the-arts methods.

Abstract

In recent years, 2D-to-3D pose uplifting in monocular 3D Human Pose Estimation (HPE) has attracted widespread research interest. GNN-based methods and Transformer-based methods have become mainstream architectures due to their advanced spatial and temporal feature learning capacities. However, existing approaches typically construct joint-wise and frame-wise attention alignments in spatial and temporal domains, resulting in dense connections that introduce considerable local redundancy and computational overhead. In this paper, we take a global approach to exploit spatio-temporal information and realise efficient 3D HPE with a concise Graph and Skipped Transformer architecture. Specifically, in Spatial Encoding stage, coarse-grained body parts are deployed to construct Spatial Graph Network with a fully data-driven adaptive topology, ensuring model flexibility and generalizability across various poses. In Temporal Encoding and Decoding stages, a simple yet effective Skipped Transformer is proposed to capture long-range temporal dependencies and implement hierarchical feature aggregation. A straightforward Data Rolling strategy is also developed to introduce dynamic information into 2D pose sequence. Extensive experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks. G-SFormer series methods achieve superior performances compared with previous state-of-the-arts with only around ten percent of parameters and significantly reduced computational complexity. Additionally, G-SFormer also exhibits outstanding robustness to inaccuracies in detected 2D poses.
Paper Structure (17 sections, 12 equations, 10 figures, 3 tables)

This paper contains 17 sections, 12 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Left: Comparison of spatio-temporal correlation modeling methods: (a) building joint-wise connections and frame-wise connections for each joint (b) building joint-wise connections and frame-wise connections for the pose representation (c) our G-SFomer: constructing part-based spatial alignments and long-range temporal skipped-connections. Right: MPJPE (mm) vs. MFLOPs of the proposed G-SFormer and competitors on Human3.6M dataset, where marker size indicates model size.
  • Figure 2: Graph and Skipped Transformer (G-SFormer) consists of three modules: Spatial Graph Encoder for spatial modeling of human body part correlations, Skipped Transformer Encoder and Decoder for temporal feature hierarchical extraction and aggregation. Skip-sampled pose token sets are reordered to the original sequence after encoded by temporal Skipped Transformer, and progressively aggregated by Skipped Multi-head Self-Attention (MSA) to get the target pose representation in the temporal decoding stage.
  • Figure 3: (a) Architecture of the Spatial Graph Encoder. (b) Updating process of graph nodes. Using node $f_{p5}$ as an example, it is concatenated with other aggregated part features to get $f_{p5}^{'}$.
  • Figure 4: Data completion strategies for 2D pose input. Taking target frame at t=3 as example, where 2 previous frames need to be completed for a full 9-frame input sequence. Unlike conventional methods which copy edge frame at t=1 multiple times (b), Data Expanding and Data Rolling strategies are proposed to replicate 2D pose step by step (c), or to capture a clip of the pose sequence for completion (d).
  • Figure 5: Quantitative comparisons with SOTA methods on MPI-INF-3DHP dataset. Best: bold, second best: underlined.
  • ...and 5 more figures