Table of Contents
Fetching ...

DPoser: Diffusion Model as Robust 3D Human Pose Prior

Junzhe Lu, Jing Lin, Hongkun Dou, Ailing Zeng, Yue Deng, Yulun Zhang, Haoqian Wang

TL;DR

DPoser introduces an unconditional diffusion-based 3D human pose prior trained on SMPL pose representations and deployed as a versatile regularizer within inverse-problem formulations for pose-related tasks. By employing variational diffusion sampling and a test-time truncated timestep scheduling tailored to pose data, it achieves consistent improvements over state-of-the-art priors across human mesh recovery, pose completion, motion denoising, and pose generation. Key contributions include the unconditional diffusion prior, the test-time truncation strategy, and a comprehensive set of experiments plus ablations that validate robustness and generalization. This work enables flexible, optimization-driven pose estimation pipelines with improved realism and diversity, advancing practical 3D human pose understanding from single images and sequences.

Abstract

This work targets to construct a robust human pose prior. However, it remains a persistent challenge due to biomechanical constraints and diverse human movements. Traditional priors like VAEs and NDFs often exhibit shortcomings in realism and generalization, notably with unseen noisy poses. To address these issues, we introduce DPoser, a robust and versatile human pose prior built upon diffusion models. DPoser regards various pose-centric tasks as inverse problems and employs variational diffusion sampling for efficient solving. Accordingly, designed with optimization frameworks, DPoser seamlessly benefits human mesh recovery, pose generation, pose completion, and motion denoising tasks. Furthermore, due to the disparity between the articulated poses and structured images, we propose truncated timestep scheduling to enhance the effectiveness of DPoser. Our approach demonstrates considerable enhancements over common uniform scheduling used in image domains, boasting improvements of 5.4%, 17.2%, and 3.8% across human mesh recovery, pose completion, and motion denoising, respectively. Comprehensive experiments demonstrate the superiority of DPoser over existing state-of-the-art pose priors across multiple tasks.

DPoser: Diffusion Model as Robust 3D Human Pose Prior

TL;DR

DPoser introduces an unconditional diffusion-based 3D human pose prior trained on SMPL pose representations and deployed as a versatile regularizer within inverse-problem formulations for pose-related tasks. By employing variational diffusion sampling and a test-time truncated timestep scheduling tailored to pose data, it achieves consistent improvements over state-of-the-art priors across human mesh recovery, pose completion, motion denoising, and pose generation. Key contributions include the unconditional diffusion prior, the test-time truncation strategy, and a comprehensive set of experiments plus ablations that validate robustness and generalization. This work enables flexible, optimization-driven pose estimation pipelines with improved realism and diversity, advancing practical 3D human pose understanding from single images and sequences.

Abstract

This work targets to construct a robust human pose prior. However, it remains a persistent challenge due to biomechanical constraints and diverse human movements. Traditional priors like VAEs and NDFs often exhibit shortcomings in realism and generalization, notably with unseen noisy poses. To address these issues, we introduce DPoser, a robust and versatile human pose prior built upon diffusion models. DPoser regards various pose-centric tasks as inverse problems and employs variational diffusion sampling for efficient solving. Accordingly, designed with optimization frameworks, DPoser seamlessly benefits human mesh recovery, pose generation, pose completion, and motion denoising tasks. Furthermore, due to the disparity between the articulated poses and structured images, we propose truncated timestep scheduling to enhance the effectiveness of DPoser. Our approach demonstrates considerable enhancements over common uniform scheduling used in image domains, boasting improvements of 5.4%, 17.2%, and 3.8% across human mesh recovery, pose completion, and motion denoising, respectively. Comprehensive experiments demonstrate the superiority of DPoser over existing state-of-the-art pose priors across multiple tasks.
Paper Structure (28 sections, 20 equations, 15 figures, 13 tables, 1 algorithm)

This paper contains 28 sections, 20 equations, 15 figures, 13 tables, 1 algorithm.

Figures (15)

  • Figure 1: An overview of DPoser's versatility and performance across multiple pose-related tasks. Built on diffusion models, DPoser serves as a robust and adaptable pose prior. Shown are scenarios in (a) pose generation, (b) human mesh recovery, (c) motion denoising, and (d) pose completion. DPoser consistently outstrips existing priors like VPoser pavlakos2019expressive in performance benchmarks.
  • Figure 2: Overview of the DPoser Methodology. Panel (a) presents three tasks: human mesh recovery, pose completion, and motion denoising, with omissions like camera optimization for clarity. Panel (b) demonstrates the DPoser regularization process, introducing noise to the current pose and applying a one-step denoiser to achieve a denoised pose. $L_\text{DPoser}$ is computed between the denoised and current pose. Panel (c) outlines the optimization process from initial to fitted poses via loss minimization.
  • Figure 3: Illustration of the rationale behind our proposed truncated timestep scheduling. We employ the deterministic DDIM sampler song2020denoising with limited steps and assess the quality of generated poses using the Self-Intersection percentage (SI).
  • Figure 4: Qualitative comparison of generated human poses: (b) illustrates naturalistic poses aligned with real-world data, whereas (c) shows poses that, despite superior metrics, lack natural appearance. *We use a DDIM sampler song2020denoising with only 10 steps.
  • Figure 5: Human mesh recovery. (a) Fitting from scratch. *Ground truth for the EHF dataset is annotated in SMPL-X pavlakos2019expressive, which extends SMPL loper2015smpl with fully articulated hands and an expressive face. (b) Initialization using the CLIFF li2022cliff prediction.
  • ...and 10 more figures