Table of Contents
Fetching ...

ManiPose: Manifold-Constrained Multi-Hypothesis 3D Human Pose Estimation

Cédric Rommel, Victor Letzelter, Nermin Samet, Renaud Marlet, Matthieu Cord, Patrick Pérez, Eduardo Valle

TL;DR

ManiPose tackles depth ambiguity in monocular 3D human pose estimation by combining multi-hypothesis lifting with a pose manifold constraint, producing multiple plausible 3D poses per 2D input while ensuring the poses lie on a consistent, rigid-skeleton manifold. It employs two disentangled modules for segment lengths and joint rotations, uses 6D rotation representations, and decodes via forward kinematics, all within a winner-takes-all plus score-based multi-choice learning framework. The work provides formal arguments showing that single-hypothesis regression cannot simultaneously minimize MPJPE and guarantee pose consistency, and demonstrates that a small number of hypotheses on the pose manifold achieves superior pose consistency with competitive MPJPE on Human3.6M and MPI-INF-3DHP. Empirically, ManiPose outperforms state-of-the-art methods in pose consistency by a large margin while remaining highly competitive on conventional accuracy metrics, suggesting substantial practical gains for stable, plausible 3D pose reconstructions in real-world video. The combination of theoretical insights, ablations, and strong cross-dataset results highlights the importance of modeling depth ambiguity with manifold-constrained multi-hypothesis predictions in 3D human pose estimation.

Abstract

We propose ManiPose, a manifold-constrained multi-hypothesis model for human-pose 2D-to-3D lifting. We provide theoretical and empirical evidence that, due to the depth ambiguity inherent to monocular 3D human pose estimation, traditional regression models suffer from pose-topology consistency issues, which standard evaluation metrics (MPJPE, P-MPJPE and PCK) fail to assess. ManiPose addresses depth ambiguity by proposing multiple candidate 3D poses for each 2D input, each with its estimated plausibility. Unlike previous multi-hypothesis approaches, ManiPose forgoes generative models, greatly facilitating its training and usage. By constraining the outputs to lie on the human pose manifold, ManiPose guarantees the consistency of all hypothetical poses, in contrast to previous works. We showcase the performance of ManiPose on real-world datasets, where it outperforms state-of-the-art models in pose consistency by a large margin while being very competitive on the MPJPE metric.

ManiPose: Manifold-Constrained Multi-Hypothesis 3D Human Pose Estimation

TL;DR

ManiPose tackles depth ambiguity in monocular 3D human pose estimation by combining multi-hypothesis lifting with a pose manifold constraint, producing multiple plausible 3D poses per 2D input while ensuring the poses lie on a consistent, rigid-skeleton manifold. It employs two disentangled modules for segment lengths and joint rotations, uses 6D rotation representations, and decodes via forward kinematics, all within a winner-takes-all plus score-based multi-choice learning framework. The work provides formal arguments showing that single-hypothesis regression cannot simultaneously minimize MPJPE and guarantee pose consistency, and demonstrates that a small number of hypotheses on the pose manifold achieves superior pose consistency with competitive MPJPE on Human3.6M and MPI-INF-3DHP. Empirically, ManiPose outperforms state-of-the-art methods in pose consistency by a large margin while remaining highly competitive on conventional accuracy metrics, suggesting substantial practical gains for stable, plausible 3D pose reconstructions in real-world video. The combination of theoretical insights, ablations, and strong cross-dataset results highlights the importance of modeling depth ambiguity with manifold-constrained multi-hypothesis predictions in 3D human pose estimation.

Abstract

We propose ManiPose, a manifold-constrained multi-hypothesis model for human-pose 2D-to-3D lifting. We provide theoretical and empirical evidence that, due to the depth ambiguity inherent to monocular 3D human pose estimation, traditional regression models suffer from pose-topology consistency issues, which standard evaluation metrics (MPJPE, P-MPJPE and PCK) fail to assess. ManiPose addresses depth ambiguity by proposing multiple candidate 3D poses for each 2D input, each with its estimated plausibility. Unlike previous multi-hypothesis approaches, ManiPose forgoes generative models, greatly facilitating its training and usage. By constraining the outputs to lie on the human pose manifold, ManiPose guarantees the consistency of all hypothetical poses, in contrast to previous works. We showcase the performance of ManiPose on real-world datasets, where it outperforms state-of-the-art models in pose consistency by a large margin while being very competitive on the MPJPE metric.
Paper Structure (30 sections, 6 theorems, 31 equations, 11 figures, 10 tables, 2 algorithms)

This paper contains 30 sections, 6 theorems, 31 equations, 11 figures, 10 tables, 2 algorithms.

Key Result

Proposition 4.1

Assuming a rigid skeleton, all poses of a movement $\mathrm{m}=[\mathrm{p}_t]_{t=1}^T$ lie on a manifold $\mathcal{M}$ of dimension $2(J-1)$:

Figures (11)

  • Figure 1: Optimizing both 3D position and pose consistency requires combining constraints and multiple hypotheses. Results from \ref{['tab:consistency', 'tab:ablation']}. Previous unconstrained methods provide inconsistent poses (top). Regularization (MR) and disentanglement constraints improve consistency, but degrade joint position error (bottom-right). Ours is the only method that achieves both good joint error and consistency, thanks to a combination of disentanglement and a few hypotheses (see circles sizes).
  • Figure 2: Overview of ManiPose. The rotations module predicts $K$ possible sequences of segment rotations with their corresponding likelihoods (scores), while the segments module estimates the shared segment lengths. Hence, predicted poses are constrained to a manifold defined by the estimated lengths, guaranteeing their consistency.
  • Figure 3: Pose decoder overview.
  • Figure 4: (A) 1D-to-2D articulated pose lifting problem. (B) True MSE minimizers under a multimodal distribution. One-to-one mappings cannot both reach optimal performance and stay on the pose manifold (dashed circle). (C) Without depth ambiguity, unconstrained models are effective. (D) Ambiguity from multimodal distributions challenges both constrained and unconstrained models. Multi-hypothesis approaches can deliver an acceptable solution to the problem.
  • Figure 5: MPSCE, MPSSE and MPJPE per segment/coordinate (lower is better). ManiPose mostly helps to deal with the depth ambiguity ($z$ coordinate). Ground-truth poses are represented but not visible because they have perfect consistency.
  • ...and 6 more figures

Theorems & Definitions (15)

  • Proposition 4.1: Human pose manifold
  • Proposition 4.2: Inconsistency of MSE minimizer
  • Definition A.1: Human skeleton
  • Definition A.2: Human pose and movement
  • proof : Proposition \ref{['th:manifold']}
  • proof : Proposition \ref{['th:mpjpe']}
  • Corollary B.1
  • proof
  • Corollary B.2
  • proof
  • ...and 5 more