Table of Contents
Fetching ...

MPL: Lifting 3D Human Pose from Multi-view 2D Poses

Seyed Abolfazl Ghasemzadeh, Alexandre Alahi, Christophe De Vleeschouwer

TL;DR

This work tackles the challenge of recovering 3D human pose from multi-view 2D data under real-world variability by decoupling 2D keypoint detection from 3D pose lifting. It introduces MPL, a transformer-based lifter that fuses per-view 2D skeletons via a Spatial Pose Transformer and a Fusion Pose Transformer to output world-coordinate 3D poses, trained with synthetic 2D-3D pairs generated by the Mesh-based Pose Dataset Generator from AMASS meshes. The approach achieves substantial performance gains over triangulation—up to about 45% reduction in MPJPE with two views on Human3.6M—and remains robust as the number of views changes, all while enabling real-time inference. The mesh-based data generator and synthetic training scheme enable deployment across arbitrary camera setups, broadening the applicability of 3D pose estimation in unconstrained environments, with the code made publicly available.

Abstract

Estimating 3D human poses from 2D images is challenging due to occlusions and projective acquisition. Learning-based approaches have been largely studied to address this challenge, both in single and multi-view setups. These solutions however fail to generalize to real-world cases due to the lack of (multi-view) 'in-the-wild' images paired with 3D poses for training. For this reason, we propose combining 2D pose estimation, for which large and rich training datasets exist, and 2D-to-3D pose lifting, using a transformer-based network that can be trained from synthetic 2D-3D pose pairs. Our experiments demonstrate decreases up to 45% in MPJPE errors compared to the 3D pose obtained by triangulating the 2D poses. The framework's source code is available at https://github.com/aghasemzadeh/OpenMPL .

MPL: Lifting 3D Human Pose from Multi-view 2D Poses

TL;DR

This work tackles the challenge of recovering 3D human pose from multi-view 2D data under real-world variability by decoupling 2D keypoint detection from 3D pose lifting. It introduces MPL, a transformer-based lifter that fuses per-view 2D skeletons via a Spatial Pose Transformer and a Fusion Pose Transformer to output world-coordinate 3D poses, trained with synthetic 2D-3D pairs generated by the Mesh-based Pose Dataset Generator from AMASS meshes. The approach achieves substantial performance gains over triangulation—up to about 45% reduction in MPJPE with two views on Human3.6M—and remains robust as the number of views changes, all while enabling real-time inference. The mesh-based data generator and synthetic training scheme enable deployment across arbitrary camera setups, broadening the applicability of 3D pose estimation in unconstrained environments, with the code made publicly available.

Abstract

Estimating 3D human poses from 2D images is challenging due to occlusions and projective acquisition. Learning-based approaches have been largely studied to address this challenge, both in single and multi-view setups. These solutions however fail to generalize to real-world cases due to the lack of (multi-view) 'in-the-wild' images paired with 3D poses for training. For this reason, we propose combining 2D pose estimation, for which large and rich training datasets exist, and 2D-to-3D pose lifting, using a transformer-based network that can be trained from synthetic 2D-3D pose pairs. Our experiments demonstrate decreases up to 45% in MPJPE errors compared to the 3D pose obtained by triangulating the 2D poses. The framework's source code is available at https://github.com/aghasemzadeh/OpenMPL .
Paper Structure (25 sections, 1 equation, 6 figures, 5 tables)

This paper contains 25 sections, 1 equation, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison between the state-of-the-art and our MPL. On the left, the approach used by most prior works. It takes images as input and directly predicts a 3D pose. On the right, our approach splits the process into two stages: 1) Performing 2D pose estimation on each view, and 2) Fusing the information from all views to predict a 3D pose. As a key advantage, our MPL can be trained for arbitrary acquisition setups, using synthetic 2D-3D pairs of skeletons. In contrast, prior methods build on images-3D pose correspondences, which are only available for specific scenes and acquisition conditions, thereby penalizing the generalization capabilities of trained models.
  • Figure 2: MPL takes 2D poses from all views as inputs, encodes them independently, and fuses them using a transformer network.
  • Figure 3: Our Mesh-based Human Pose Dataset Generator (MHP) takes 3D mesh vertices and camera calibration matrices as inputs. It randomly positions the mesh in the scene and returns the 3D pose ground truth and one noisy 2D pose per camera view. The noise associated with the 2D poses comes from the fact that each 2D pose is estimated by an off-the-shelf 2D pose estimation model, applied to the image of the mesh rendered in the corresponding view. The 3D keypoint regressor defines the 3D pose ground truth in a way that is consistent with the definition of ground truth at testing. This makes the 2D-3D pairs of human pose appropriate for training an accurate and robust MPL, able to turn the 2D poses computed by the off-the-shelf pose estimator into a 3D skeleton that is consistent with the 3D keypoints expected at inference.
  • Figure 4: Example of inconsistency between Human3.6M and COCO format. The triangles depict the 3D ground truth keypoints projected to 2D space (Human3.6M format), and the circles depict the keypoints predicted by the off-the-shelf 2D pose estimator (COCO format). The green keypoints are the ones that are similarly defined in both formats, and the red ones correspond to the keypoints that have distinct definitions in the two formats. Note that, some of the predicted keypoints are estimated by averaging the locations of a subset of keypoints predicted by the 2D pose estimator (in COCO format), e.g. the keypoint torso is located in the middle of pelvis and neck. We do not include these averaged keypoints in the set of keypoints KP*, corresponding to keypoints with similar definition, even if their location is reasonably similar to the one adopted in Human3.6M.
  • Figure 5: Qualitative results. In the top row, the GT and MMpose 2D represent the ground truth taken from the CMU dataset and the predicted 2D pose from the off-the-shelf 2D pose estimator, respectively. In the bottom row, the GT shows the 3D ground truth from the CMU dataset, and the other two 3D poses correspond to the outputs of triangulation and MPL when the top row is given as the inputs.
  • ...and 1 more figures