MPL: Lifting 3D Human Pose from Multi-view 2D Poses
Seyed Abolfazl Ghasemzadeh, Alexandre Alahi, Christophe De Vleeschouwer
TL;DR
This work tackles the challenge of recovering 3D human pose from multi-view 2D data under real-world variability by decoupling 2D keypoint detection from 3D pose lifting. It introduces MPL, a transformer-based lifter that fuses per-view 2D skeletons via a Spatial Pose Transformer and a Fusion Pose Transformer to output world-coordinate 3D poses, trained with synthetic 2D-3D pairs generated by the Mesh-based Pose Dataset Generator from AMASS meshes. The approach achieves substantial performance gains over triangulation—up to about 45% reduction in MPJPE with two views on Human3.6M—and remains robust as the number of views changes, all while enabling real-time inference. The mesh-based data generator and synthetic training scheme enable deployment across arbitrary camera setups, broadening the applicability of 3D pose estimation in unconstrained environments, with the code made publicly available.
Abstract
Estimating 3D human poses from 2D images is challenging due to occlusions and projective acquisition. Learning-based approaches have been largely studied to address this challenge, both in single and multi-view setups. These solutions however fail to generalize to real-world cases due to the lack of (multi-view) 'in-the-wild' images paired with 3D poses for training. For this reason, we propose combining 2D pose estimation, for which large and rich training datasets exist, and 2D-to-3D pose lifting, using a transformer-based network that can be trained from synthetic 2D-3D pose pairs. Our experiments demonstrate decreases up to 45% in MPJPE errors compared to the 3D pose obtained by triangulating the 2D poses. The framework's source code is available at https://github.com/aghasemzadeh/OpenMPL .
