Table of Contents
Fetching ...

Platypose: Calibrated Zero-Shot Multi-Hypothesis 3D Human Motion Estimation

Paweł A. Pierzchlewicz, Caio O. da Silva, R. James Cotton, Fabian H. Sinz

Abstract

Single camera 3D pose estimation is an ill-defined problem due to inherent ambiguities from depth, occlusion or keypoint noise. Multi-hypothesis pose estimation accounts for this uncertainty by providing multiple 3D poses consistent with the 2D measurements. Current research has predominantly concentrated on generating multiple hypotheses for single frame static pose estimation or single hypothesis motion estimation. In this study we focus on the new task of multi-hypothesis motion estimation. Multi-hypothesis motion estimation is not simply multi-hypothesis pose estimation applied to multiple frames, which would ignore temporal correlation across frames. Instead, it requires distributions which are capable of generating temporally consistent samples, which is significantly more challenging than multi-hypothesis pose estimation or single-hypothesis motion estimation. To this end, we introduce Platypose, a framework that uses a diffusion model pretrained on 3D human motion sequences for zero-shot 3D pose sequence estimation. Platypose outperforms baseline methods on multiple hypotheses for motion estimation. Additionally, Platypose also achieves state-of-the-art calibration and competitive joint error when tested on static poses from Human3.6M, MPI-INF-3DHP and 3DPW. Finally, because it is zero-shot, our method generalizes flexibly to different settings such as multi-camera inference.

Platypose: Calibrated Zero-Shot Multi-Hypothesis 3D Human Motion Estimation

Abstract

Single camera 3D pose estimation is an ill-defined problem due to inherent ambiguities from depth, occlusion or keypoint noise. Multi-hypothesis pose estimation accounts for this uncertainty by providing multiple 3D poses consistent with the 2D measurements. Current research has predominantly concentrated on generating multiple hypotheses for single frame static pose estimation or single hypothesis motion estimation. In this study we focus on the new task of multi-hypothesis motion estimation. Multi-hypothesis motion estimation is not simply multi-hypothesis pose estimation applied to multiple frames, which would ignore temporal correlation across frames. Instead, it requires distributions which are capable of generating temporally consistent samples, which is significantly more challenging than multi-hypothesis pose estimation or single-hypothesis motion estimation. To this end, we introduce Platypose, a framework that uses a diffusion model pretrained on 3D human motion sequences for zero-shot 3D pose sequence estimation. Platypose outperforms baseline methods on multiple hypotheses for motion estimation. Additionally, Platypose also achieves state-of-the-art calibration and competitive joint error when tested on static poses from Human3.6M, MPI-INF-3DHP and 3DPW. Finally, because it is zero-shot, our method generalizes flexibly to different settings such as multi-camera inference.
Paper Structure (44 sections, 9 equations, 8 figures, 13 tables, 2 algorithms)

This paper contains 44 sections, 9 equations, 8 figures, 13 tables, 2 algorithms.

Figures (8)

  • Figure 1: Example samples from the posterior, Platypose generates samples with smooth motion. Darker color indicates later frames in time. Trajectories of wrists and feet are shown for each frame. Top 5 samples are shown at frames 64, 96, 128, 192, 224, 255. Camera icon indicates the direction from which the 2D observations are obtained, thus the depth axis is shown, where increased variance is expected.
  • Figure 2: Simplified sequence estimation problem. A) Mean and standard deviation of a Gaussian process fit to a sine function. B) Result of strategy 1 -- choosing the best sample in each frame -- same for both shuffled and non shuffled sequences. C) Result for strategy 2 -- best sequence fit as a whole -- for the shuffled sequences. D) Result for strategy 2 -- best sequence fit as a whole -- for the sequences sampled from the Gaussian process. Dotted lines are the samples from the Gaussian process, solid line is the selected sequence. Dashed line is the ground truth sine wave.
  • Figure 3: Schematic of sampling using Platypose -- A noisy 3D motion ${\bm{x}}$ is denoised by a motion diffusion model trained on H36M. The denoised 3D motion samples $\hat{{\bm{x}}}_0$ are projected to 2D with a camera model. The reprojection error between the projections and 2D observations is minimized. The updated 3D motion is diffused to $t - n$ and passed back into the diffusion model.
  • Figure 4: Examples of 3D motion estimates for Human3.6M. Darker color indicates later frames in time. Trajectories of wrists and feet are shown for each frame. Orange poses represent the best sampled hypothesis out of 200 samples, black poses are the ground truth 3D poses.
  • Figure 5: A) Impact of the number of diffusion steps on minMPJPE and the inference time. Evaluated for single frame estimation. Mean and standard deviation are plotted from 3 seeds. B) Impact of the number of samples on minMPJPE for two different sequence lengths.
  • ...and 3 more figures