Table of Contents
Fetching ...

Limitations of (Procrustes) Alignment in Assessing Multi-Person Human Pose and Shape Estimation

Drazic Martin, Pierre Perrault

TL;DR

This paper tackles the challenge of multi-person 3D pose and shape estimation in stationary-camera surveillance by stressing the importance of global world-coordinate consistency. It introduces RotAvat, a training-free post-processing pipeline with an auto-calibration step and a ground-plane alignment transform that repositions predicted 3D meshes without altering camera-view. The authors show that existing methods struggle with global translation and ground alignment, and they demonstrate qualitative improvements over BEV, SPEC, and CLIFF through RotAvat’s alignment to the ground plane. The work highlights the practical impact of enforcing world-ground consistency for reliable surveillance-based 3D scene understanding and supports evaluation using world-coordinate metrics such as W-MPJPE and W-PVE.

Abstract

We delve into the challenges of accurately estimating 3D human pose and shape in video surveillance scenarios. Beginning with the advocacy for metrics like W-MPJPE and W-PVE, which omit the (Procrustes) realignment step, to improve model evaluation, we then introduce RotAvat. This technique aims to enhance these metrics by refining the alignment of 3D meshes with the ground plane. Through qualitative comparisons, we demonstrate RotAvat's effectiveness in addressing the limitations of existing aproaches.

Limitations of (Procrustes) Alignment in Assessing Multi-Person Human Pose and Shape Estimation

TL;DR

This paper tackles the challenge of multi-person 3D pose and shape estimation in stationary-camera surveillance by stressing the importance of global world-coordinate consistency. It introduces RotAvat, a training-free post-processing pipeline with an auto-calibration step and a ground-plane alignment transform that repositions predicted 3D meshes without altering camera-view. The authors show that existing methods struggle with global translation and ground alignment, and they demonstrate qualitative improvements over BEV, SPEC, and CLIFF through RotAvat’s alignment to the ground plane. The work highlights the practical impact of enforcing world-ground consistency for reliable surveillance-based 3D scene understanding and supports evaluation using world-coordinate metrics such as W-MPJPE and W-PVE.

Abstract

We delve into the challenges of accurately estimating 3D human pose and shape in video surveillance scenarios. Beginning with the advocacy for metrics like W-MPJPE and W-PVE, which omit the (Procrustes) realignment step, to improve model evaluation, we then introduce RotAvat. This technique aims to enhance these metrics by refining the alignment of 3D meshes with the ground plane. Through qualitative comparisons, we demonstrate RotAvat's effectiveness in addressing the limitations of existing aproaches.
Paper Structure (11 sections, 3 equations, 7 figures)

This paper contains 11 sections, 3 equations, 7 figures.

Figures (7)

  • Figure 1: Comparison between BEV, SPEC, CLIFF and our proposed method, front and side views. Note that we rendered the side view with an orthographic projection to better appreciate meshes elevation relative to the ground.
  • Figure 2: More comparison between BEV, SPEC, CLIFF and our proposed method, front and side views.
  • Figure 3: More comparison between BEV, SPEC, CLIFF and our proposed method, front and side views.
  • Figure 4: More comparison between BEV, SPEC, CLIFF and our proposed method, front and side views.
  • Figure 5: An example of a pedestrian with the foot-head point pair, with an example of camera calibration parameters we want to regress. For this choice of calibration parameters, the projection matrix of the camera is defined as $P=f0000f\cos(\theta)-f\sin(\theta)-fc\cos(\theta)0\sin(\theta)\cos(\theta)-c\sin(\theta)0001$.
  • ...and 2 more figures