Table of Contents
Fetching ...

Dual-stream Transformer-GCN Model with Contextualized Representations Learning for Monocular 3D Human Pose Estimation

Mingrui Ye, Lianping Yang, Hegui Zhu, Zenghao Zheng, Xin Wang, Yantao Lo

TL;DR

The paper addresses monocular 3D HPE under depth ambiguity and limited 3D labels by introducing a Transformer-GCN dual-stream architecture that jointly models global spatial-temporal and local joint relations. It advances contextualized representations learning (CRL) through masking and self-distillation with an EMA teacher, enabling effective pre-training on unlabeled 2D pose data. The method achieves state-of-the-art MPJPE on Human3.6M (≈38.0 mm in Protocol 1, ≈31.9 mm in Protocol 2) and MPI-INF-3DHP (≈15.9 mm), with strong PCK/AUC gains, and demonstrates robust generalization to in-the-wild sequences. By integrating adaptive fusion of Transformer and GCN streams and a CRL pre-training paradigm, the approach provides a scalable, efficient path for accurate 3D pose estimation and extends to related motion tasks.

Abstract

This paper introduces a novel approach to monocular 3D human pose estimation using contextualized representation learning with the Transformer-GCN dual-stream model. Monocular 3D human pose estimation is challenged by depth ambiguity, limited 3D-labeled training data, imbalanced modeling, and restricted model generalization. To address these limitations, our work introduces a groundbreaking motion pre-training method based on contextualized representation learning. Specifically, our method involves masking 2D pose features and utilizing a Transformer-GCN dual-stream model to learn high-dimensional representations through a self-distillation setup. By focusing on contextualized representation learning and spatial-temporal modeling, our approach enhances the model's ability to understand spatial-temporal relationships between postures, resulting in superior generalization. Furthermore, leveraging the Transformer-GCN dual-stream model, our approach effectively balances global and local interactions in video pose estimation. The model adaptively integrates information from both the Transformer and GCN streams, where the GCN stream effectively learns local relationships between adjacent key points and frames, while the Transformer stream captures comprehensive global spatial and temporal features. Our model achieves state-of-the-art performance on two benchmark datasets, with an MPJPE of 38.0mm and P-MPJPE of 31.9mm on Human3.6M, and an MPJPE of 15.9mm on MPI-INF-3DHP. Furthermore, visual experiments on public datasets and in-the-wild videos demonstrate the robustness and generalization capabilities of our approach.

Dual-stream Transformer-GCN Model with Contextualized Representations Learning for Monocular 3D Human Pose Estimation

TL;DR

The paper addresses monocular 3D HPE under depth ambiguity and limited 3D labels by introducing a Transformer-GCN dual-stream architecture that jointly models global spatial-temporal and local joint relations. It advances contextualized representations learning (CRL) through masking and self-distillation with an EMA teacher, enabling effective pre-training on unlabeled 2D pose data. The method achieves state-of-the-art MPJPE on Human3.6M (≈38.0 mm in Protocol 1, ≈31.9 mm in Protocol 2) and MPI-INF-3DHP (≈15.9 mm), with strong PCK/AUC gains, and demonstrates robust generalization to in-the-wild sequences. By integrating adaptive fusion of Transformer and GCN streams and a CRL pre-training paradigm, the approach provides a scalable, efficient path for accurate 3D pose estimation and extends to related motion tasks.

Abstract

This paper introduces a novel approach to monocular 3D human pose estimation using contextualized representation learning with the Transformer-GCN dual-stream model. Monocular 3D human pose estimation is challenged by depth ambiguity, limited 3D-labeled training data, imbalanced modeling, and restricted model generalization. To address these limitations, our work introduces a groundbreaking motion pre-training method based on contextualized representation learning. Specifically, our method involves masking 2D pose features and utilizing a Transformer-GCN dual-stream model to learn high-dimensional representations through a self-distillation setup. By focusing on contextualized representation learning and spatial-temporal modeling, our approach enhances the model's ability to understand spatial-temporal relationships between postures, resulting in superior generalization. Furthermore, leveraging the Transformer-GCN dual-stream model, our approach effectively balances global and local interactions in video pose estimation. The model adaptively integrates information from both the Transformer and GCN streams, where the GCN stream effectively learns local relationships between adjacent key points and frames, while the Transformer stream captures comprehensive global spatial and temporal features. Our model achieves state-of-the-art performance on two benchmark datasets, with an MPJPE of 38.0mm and P-MPJPE of 31.9mm on Human3.6M, and an MPJPE of 15.9mm on MPI-INF-3DHP. Furthermore, visual experiments on public datasets and in-the-wild videos demonstrate the robustness and generalization capabilities of our approach.

Paper Structure

This paper contains 19 sections, 14 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Framework overview. Our method employs a dual-stream architecture with Transformer and GCN to balance local and global dependencies modeling with an adaptive fusion method. The training strategy, divided into pre-training with contextual representation learning and fine-tuning, allows for effectively transfer to robust 3D HPE and even other human motion tasks.
  • Figure 2: Detailed architecture of Transformer-GCN model. The network consists of two streams, the Transformer stream and the GCN stream. The GCN stream and the Transformer stream are responsible for learning local relations and global interaction respectively, and we use adaptive fusion method to fuse features and generate a new representation that is balanced in spatial-tempoal correlations and local-global modeling.
  • Figure 3: Framework of our pre-training task based on the contextualized representations learning. After extracting features from 2D pose input by the feature extractor layer, we apply our masking strategy and use masked representations to predict latent representations of the full version in a self-distillation setup. Through this process, the model can learn the spatio-temporal correlations and obtain great initialization.
  • Figure 4: Pose estimation visualization results for Human3.6M.
  • Figure 5: Pose estimation visualization for in-the-wild videos.
  • ...and 1 more figures