Dual-stream Transformer-GCN Model with Contextualized Representations Learning for Monocular 3D Human Pose Estimation
Mingrui Ye, Lianping Yang, Hegui Zhu, Zenghao Zheng, Xin Wang, Yantao Lo
TL;DR
The paper addresses monocular 3D HPE under depth ambiguity and limited 3D labels by introducing a Transformer-GCN dual-stream architecture that jointly models global spatial-temporal and local joint relations. It advances contextualized representations learning (CRL) through masking and self-distillation with an EMA teacher, enabling effective pre-training on unlabeled 2D pose data. The method achieves state-of-the-art MPJPE on Human3.6M (≈38.0 mm in Protocol 1, ≈31.9 mm in Protocol 2) and MPI-INF-3DHP (≈15.9 mm), with strong PCK/AUC gains, and demonstrates robust generalization to in-the-wild sequences. By integrating adaptive fusion of Transformer and GCN streams and a CRL pre-training paradigm, the approach provides a scalable, efficient path for accurate 3D pose estimation and extends to related motion tasks.
Abstract
This paper introduces a novel approach to monocular 3D human pose estimation using contextualized representation learning with the Transformer-GCN dual-stream model. Monocular 3D human pose estimation is challenged by depth ambiguity, limited 3D-labeled training data, imbalanced modeling, and restricted model generalization. To address these limitations, our work introduces a groundbreaking motion pre-training method based on contextualized representation learning. Specifically, our method involves masking 2D pose features and utilizing a Transformer-GCN dual-stream model to learn high-dimensional representations through a self-distillation setup. By focusing on contextualized representation learning and spatial-temporal modeling, our approach enhances the model's ability to understand spatial-temporal relationships between postures, resulting in superior generalization. Furthermore, leveraging the Transformer-GCN dual-stream model, our approach effectively balances global and local interactions in video pose estimation. The model adaptively integrates information from both the Transformer and GCN streams, where the GCN stream effectively learns local relationships between adjacent key points and frames, while the Transformer stream captures comprehensive global spatial and temporal features. Our model achieves state-of-the-art performance on two benchmark datasets, with an MPJPE of 38.0mm and P-MPJPE of 31.9mm on Human3.6M, and an MPJPE of 15.9mm on MPI-INF-3DHP. Furthermore, visual experiments on public datasets and in-the-wild videos demonstrate the robustness and generalization capabilities of our approach.
