Table of Contents
Fetching ...

Diffusion-based Pose Refinement and Muti-hypothesis Generation for 3D Human Pose Estimaiton

Hongbo Kang, Yong Wang, Mengyuan Liu, Doudou Wu, Peng Liu, Xinlin Yuan, Wenming Yang

TL;DR

DRPose tackles the instability of probabilistic 3D HPE by introducing a diffusion-based refinement that starts from a deterministic pose and refines it through reverse diffusion, producing well-aligned multi-hypothesis outputs. The framework relies on a Scalable Graph Convolution Transformer to denoise and learn latent 3D pose features, and a Pose Refinement Module that balances certain and uncertain components to yield a refined pose. By generating multiple hypotheses via different diffusion noises and iterating, then aggregating them, DRPose achieves state-of-the-art accuracy on both single- and multi-hypothesis 3D pose estimation benchmarks, notably on Human3.6M and MPI-INF-3DHP. The approach offers practical improvements in robustness and accuracy for 3D human pose estimation in real-world settings where 2D detector uncertainty and depth ambiguity are prevalent.

Abstract

Previous probabilistic models for 3D Human Pose Estimation (3DHPE) aimed to enhance pose accuracy by generating multiple hypotheses. However, most of the hypotheses generated deviate substantially from the true pose. Compared to deterministic models, the excessive uncertainty in probabilistic models leads to weaker performance in single-hypothesis prediction. To address these two challenges, we propose a diffusion-based refinement framework called DRPose, which refines the output of deterministic models by reverse diffusion and achieves more suitable multi-hypothesis prediction for the current pose benchmark by multi-step refinement with multiple noises. To this end, we propose a Scalable Graph Convolution Transformer (SGCT) and a Pose Refinement Module (PRM) for denoising and refining. Extensive experiments on Human3.6M and MPI-INF-3DHP datasets demonstrate that our method achieves state-of-the-art performance on both single and multi-hypothesis 3DHPE. Code is available at https://github.com/KHB1698/DRPose.

Diffusion-based Pose Refinement and Muti-hypothesis Generation for 3D Human Pose Estimaiton

TL;DR

DRPose tackles the instability of probabilistic 3D HPE by introducing a diffusion-based refinement that starts from a deterministic pose and refines it through reverse diffusion, producing well-aligned multi-hypothesis outputs. The framework relies on a Scalable Graph Convolution Transformer to denoise and learn latent 3D pose features, and a Pose Refinement Module that balances certain and uncertain components to yield a refined pose. By generating multiple hypotheses via different diffusion noises and iterating, then aggregating them, DRPose achieves state-of-the-art accuracy on both single- and multi-hypothesis 3D pose estimation benchmarks, notably on Human3.6M and MPI-INF-3DHP. The approach offers practical improvements in robustness and accuracy for 3D human pose estimation in real-world settings where 2D detector uncertainty and depth ambiguity are prevalent.

Abstract

Previous probabilistic models for 3D Human Pose Estimation (3DHPE) aimed to enhance pose accuracy by generating multiple hypotheses. However, most of the hypotheses generated deviate substantially from the true pose. Compared to deterministic models, the excessive uncertainty in probabilistic models leads to weaker performance in single-hypothesis prediction. To address these two challenges, we propose a diffusion-based refinement framework called DRPose, which refines the output of deterministic models by reverse diffusion and achieves more suitable multi-hypothesis prediction for the current pose benchmark by multi-step refinement with multiple noises. To this end, we propose a Scalable Graph Convolution Transformer (SGCT) and a Pose Refinement Module (PRM) for denoising and refining. Extensive experiments on Human3.6M and MPI-INF-3DHP datasets demonstrate that our method achieves state-of-the-art performance on both single and multi-hypothesis 3DHPE. Code is available at https://github.com/KHB1698/DRPose.
Paper Structure (12 sections, 7 equations, 4 figures, 5 tables)

This paper contains 12 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of the DRPose framework in the inference stage for pose refinement.Top: Single-hypothesis inference. The initial 3D pose is combined with a pure noise and refined once to obtain the refined 3D pose. Bottom: Multi-hypothesis inference. The initial 3D pose is combined with multiple pure noise and refined multiple times to obtain multiple refined 3D poses. In real-world applications, the final 3D pose is obtained through the aggregation from multi-hypothesis.
  • Figure 2: Overview of the DRPose framework in the training stage. Through the forward process, the ground truth is diffused to obtain the noisy 3D pose, and it is combined with the initial 3D pose obtained by the initial predictor as the input of the reverse process. Then, using 2D pose and timestep as conditions to denoise and refine the input. The refined 3D pose is obtained at last.
  • Figure 3: Overview of the Refinement Model, which consists of the Scalable Graph Convolution Transformer (SGCT) and Pose Refinement Module (PRM).
  • Figure 4: Qualitative visual results of our method on the H3.6M test dataset. The left shows the initial 3D pose obtained by DC-GCTkang2023double, and the right shows the single and multiple refined 3D poses obtained by our DRPose. The blue pose represents the ground truth.