Table of Contents
Fetching ...

MPM: A Unified 2D-3D Human Pose Representation via Masked Pose Modeling

Zhenyu Zhang, Wenhao Chai, Zhongyu Jiang, Tian Ye, Mingli Song, Jenq-Neng Hwang, Gaoang Wang

TL;DR

MPM tackles the problem of estimating 3D human poses from 2D pose sequences by unifying 2D and 3D pose representations in a shared feature space using a single-stream transformer. It introduces masked pose modeling with two pretext tasks and a high masking ratio to capture spatial-temporal relations across modalities, enabling 3D HPE, occluded 2D-to-3D estimation, and pose completion within one framework. The approach achieves state-of-the-art performance on MPI-INF-3DHP and competitive results on other benchmarks, while maintaining relatively low computational cost. The work highlights the benefits of cross-modal pretraining and unified representations, and points to future directions such as scaling with larger data and incorporating additional modalities.

Abstract

Estimating 3D human poses only from a 2D human pose sequence is thoroughly explored in recent years. Yet, prior to this, no such work has attempted to unify 2D and 3D pose representations in the shared feature space. In this paper, we propose \mpm, a unified 2D-3D human pose representation framework via masked pose modeling. We treat 2D and 3D poses as two different modalities like vision and language and build a single-stream transformer-based architecture. We apply two pretext tasks, which are masked 2D pose modeling, and masked 3D pose modeling to pre-train our network and use full-supervision to perform further fine-tuning. A high masking ratio of $71.8~\%$ in total with a spatio-temporal mask sampling strategy leads to better relation modeling both in spatial and temporal domains. \mpm~can handle multiple tasks including 3D human pose estimation, 3D pose estimation from occluded 2D pose, and 3D pose completion in a \textbf{single} framework. We conduct extensive experiments and ablation studies on several widely used human pose datasets and achieve state-of-the-art performance on MPI-INF-3DHP.

MPM: A Unified 2D-3D Human Pose Representation via Masked Pose Modeling

TL;DR

MPM tackles the problem of estimating 3D human poses from 2D pose sequences by unifying 2D and 3D pose representations in a shared feature space using a single-stream transformer. It introduces masked pose modeling with two pretext tasks and a high masking ratio to capture spatial-temporal relations across modalities, enabling 3D HPE, occluded 2D-to-3D estimation, and pose completion within one framework. The approach achieves state-of-the-art performance on MPI-INF-3DHP and competitive results on other benchmarks, while maintaining relatively low computational cost. The work highlights the benefits of cross-modal pretraining and unified representations, and points to future directions such as scaling with larger data and incorporating additional modalities.

Abstract

Estimating 3D human poses only from a 2D human pose sequence is thoroughly explored in recent years. Yet, prior to this, no such work has attempted to unify 2D and 3D pose representations in the shared feature space. In this paper, we propose \mpm, a unified 2D-3D human pose representation framework via masked pose modeling. We treat 2D and 3D poses as two different modalities like vision and language and build a single-stream transformer-based architecture. We apply two pretext tasks, which are masked 2D pose modeling, and masked 3D pose modeling to pre-train our network and use full-supervision to perform further fine-tuning. A high masking ratio of in total with a spatio-temporal mask sampling strategy leads to better relation modeling both in spatial and temporal domains. \mpm~can handle multiple tasks including 3D human pose estimation, 3D pose estimation from occluded 2D pose, and 3D pose completion in a \textbf{single} framework. We conduct extensive experiments and ablation studies on several widely used human pose datasets and achieve state-of-the-art performance on MPI-INF-3DHP.
Paper Structure (36 sections, 11 equations, 8 figures, 10 tables)

This paper contains 36 sections, 11 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Comparison of the existing 3d human pose estimation lifting method and our MPM. (a) End-to-end training without pre-training using any backbone; (b) Pre-training with mask 2D pose modeling and then fine-tuning shan2022p; (c) our MPM: pre-trained with both mask 2D and 3D pose modeling, in which most of the parameters in shared layers to learn a unified representation.
  • Figure 2: We conduct two training stages to train the proposed network. In stage I, we apply two pretext tasks including (a) masked 2D modeling, (b) masked 3D modeling. After pre-training on those masked modeling pretext tasks, the model has learned the prior knowledge of spatial and temporal relations of both 2D and 3D human poses. In stage II, we use unmasked 2D-3D pose pairs to further fine-tune the network on 3D HPE or use partial 2D-3D/partial 3D-3D to perform fine-tune on pose completion.
  • Figure 3: Detailed architecture of the shared transformer layers. We take $243$ frames as the tensor size in the pipeline.
  • Figure 4: Illustration of spatial-temporal masking sampling strategy.
  • Figure 5: Qualitative visualization comparison with previous method zheng20213d and ground truth on Human3.6M benchmarks.
  • ...and 3 more figures