Limitations of (Procrustes) Alignment in Assessing Multi-Person Human Pose and Shape Estimation

Drazic Martin; Pierre Perrault

Limitations of (Procrustes) Alignment in Assessing Multi-Person Human Pose and Shape Estimation

Drazic Martin, Pierre Perrault

TL;DR

This paper tackles the challenge of multi-person 3D pose and shape estimation in stationary-camera surveillance by stressing the importance of global world-coordinate consistency. It introduces RotAvat, a training-free post-processing pipeline with an auto-calibration step and a ground-plane alignment transform that repositions predicted 3D meshes without altering camera-view. The authors show that existing methods struggle with global translation and ground alignment, and they demonstrate qualitative improvements over BEV, SPEC, and CLIFF through RotAvat’s alignment to the ground plane. The work highlights the practical impact of enforcing world-ground consistency for reliable surveillance-based 3D scene understanding and supports evaluation using world-coordinate metrics such as W-MPJPE and W-PVE.

Abstract

We delve into the challenges of accurately estimating 3D human pose and shape in video surveillance scenarios. Beginning with the advocacy for metrics like W-MPJPE and W-PVE, which omit the (Procrustes) realignment step, to improve model evaluation, we then introduce RotAvat. This technique aims to enhance these metrics by refining the alignment of 3D meshes with the ground plane. Through qualitative comparisons, we demonstrate RotAvat's effectiveness in addressing the limitations of existing aproaches.

Limitations of (Procrustes) Alignment in Assessing Multi-Person Human Pose and Shape Estimation

TL;DR

Abstract

Paper Structure (11 sections, 3 equations, 7 figures)

This paper contains 11 sections, 3 equations, 7 figures.

Introduction
Related work
Issues with existing methods
Other recent related approaches
Metrics
Our approach and how it compares with existing solutions
Qualitative comparison
Our approach
Auto-calibration
RotAvat
Conclusion

Figures (7)

Figure 1: Comparison between BEV, SPEC, CLIFF and our proposed method, front and side views. Note that we rendered the side view with an orthographic projection to better appreciate meshes elevation relative to the ground.
Figure 2: More comparison between BEV, SPEC, CLIFF and our proposed method, front and side views.
Figure 3: More comparison between BEV, SPEC, CLIFF and our proposed method, front and side views.
Figure 4: More comparison between BEV, SPEC, CLIFF and our proposed method, front and side views.
Figure 5: An example of a pedestrian with the foot-head point pair, with an example of camera calibration parameters we want to regress. For this choice of calibration parameters, the projection matrix of the camera is defined as $P=f0000f\cos(\theta)-f\sin(\theta)-fc\cos(\theta)0\sin(\theta)\cos(\theta)-c\sin(\theta)0001$.
...and 2 more figures

Limitations of (Procrustes) Alignment in Assessing Multi-Person Human Pose and Shape Estimation

TL;DR

Abstract

Limitations of (Procrustes) Alignment in Assessing Multi-Person Human Pose and Shape Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)