Limitations of (Procrustes) Alignment in Assessing Multi-Person Human Pose and Shape Estimation
Drazic Martin, Pierre Perrault
TL;DR
This paper tackles the challenge of multi-person 3D pose and shape estimation in stationary-camera surveillance by stressing the importance of global world-coordinate consistency. It introduces RotAvat, a training-free post-processing pipeline with an auto-calibration step and a ground-plane alignment transform that repositions predicted 3D meshes without altering camera-view. The authors show that existing methods struggle with global translation and ground alignment, and they demonstrate qualitative improvements over BEV, SPEC, and CLIFF through RotAvat’s alignment to the ground plane. The work highlights the practical impact of enforcing world-ground consistency for reliable surveillance-based 3D scene understanding and supports evaluation using world-coordinate metrics such as W-MPJPE and W-PVE.
Abstract
We delve into the challenges of accurately estimating 3D human pose and shape in video surveillance scenarios. Beginning with the advocacy for metrics like W-MPJPE and W-PVE, which omit the (Procrustes) realignment step, to improve model evaluation, we then introduce RotAvat. This technique aims to enhance these metrics by refining the alignment of 3D meshes with the ground plane. Through qualitative comparisons, we demonstrate RotAvat's effectiveness in addressing the limitations of existing aproaches.
