Table of Contents
Fetching ...

D3PRefiner: A Diffusion-based Denoise Method for 3D Human Pose Refinement

Danqi Yan, Qing Gao, Yuepeng Qian, Xinxing Chen, Chenglong Fu, Yuquan Leng

TL;DR

This paper tackles depth-ambiguous monocular 3D pose estimation by introducing D3PRefiner, a diffusion-based refinement framework that denoises outputs of any seq2seq 3D pose estimator using 2D pose sequences as conditioning. It models the forward noise as a conditional Gaussian distribution and trains a PoseDenoiser to perform reverse diffusion, enabling efficient refinement via DDIM acceleration over $K$ steps. The approach yields consistent accuracy gains on Human3.6M, reducing MPJPE and P-MPJPE by around $10$–$12\%$ across multiple baselines and sequence lengths, with larger gains when using ground-truth 2D poses. This diffusion-based refinement offers a practical path to markedly improve monocular 3D pose reliability and coherence, and opens avenues for applying conditional diffusion to other sequence-to-sequence estimation tasks.

Abstract

Three-dimensional (3D) human pose estimation using a monocular camera has gained increasing attention due to its ease of implementation and the abundance of data available from daily life. However, owing to the inherent depth ambiguity in images, the accuracy of existing monocular camera-based 3D pose estimation methods remains unsatisfactory, and the estimated 3D poses usually include much noise. By observing the histogram of this noise, we find each dimension of the noise follows a certain distribution, which indicates the possibility for a neural network to learn the mapping between noisy poses and ground truth poses. In this work, in order to obtain more accurate 3D poses, a Diffusion-based 3D Pose Refiner (D3PRefiner) is proposed to refine the output of any existing 3D pose estimator. We first introduce a conditional multivariate Gaussian distribution to model the distribution of noisy 3D poses, using paired 2D poses and noisy 3D poses as conditions to achieve greater accuracy. Additionally, we leverage the architecture of current diffusion models to convert the distribution of noisy 3D poses into ground truth 3D poses. To evaluate the effectiveness of the proposed method, two state-of-the-art sequence-to-sequence 3D pose estimators are used as basic 3D pose estimation models, and the proposed method is evaluated on different types of 2D poses and different lengths of the input sequence. Experimental results demonstrate the proposed architecture can significantly improve the performance of current sequence-to-sequence 3D pose estimators, with a reduction of at least 10.3% in the mean per joint position error (MPJPE) and at least 11.0% in the Procrustes MPJPE (P-MPJPE).

D3PRefiner: A Diffusion-based Denoise Method for 3D Human Pose Refinement

TL;DR

This paper tackles depth-ambiguous monocular 3D pose estimation by introducing D3PRefiner, a diffusion-based refinement framework that denoises outputs of any seq2seq 3D pose estimator using 2D pose sequences as conditioning. It models the forward noise as a conditional Gaussian distribution and trains a PoseDenoiser to perform reverse diffusion, enabling efficient refinement via DDIM acceleration over steps. The approach yields consistent accuracy gains on Human3.6M, reducing MPJPE and P-MPJPE by around across multiple baselines and sequence lengths, with larger gains when using ground-truth 2D poses. This diffusion-based refinement offers a practical path to markedly improve monocular 3D pose reliability and coherence, and opens avenues for applying conditional diffusion to other sequence-to-sequence estimation tasks.

Abstract

Three-dimensional (3D) human pose estimation using a monocular camera has gained increasing attention due to its ease of implementation and the abundance of data available from daily life. However, owing to the inherent depth ambiguity in images, the accuracy of existing monocular camera-based 3D pose estimation methods remains unsatisfactory, and the estimated 3D poses usually include much noise. By observing the histogram of this noise, we find each dimension of the noise follows a certain distribution, which indicates the possibility for a neural network to learn the mapping between noisy poses and ground truth poses. In this work, in order to obtain more accurate 3D poses, a Diffusion-based 3D Pose Refiner (D3PRefiner) is proposed to refine the output of any existing 3D pose estimator. We first introduce a conditional multivariate Gaussian distribution to model the distribution of noisy 3D poses, using paired 2D poses and noisy 3D poses as conditions to achieve greater accuracy. Additionally, we leverage the architecture of current diffusion models to convert the distribution of noisy 3D poses into ground truth 3D poses. To evaluate the effectiveness of the proposed method, two state-of-the-art sequence-to-sequence 3D pose estimators are used as basic 3D pose estimation models, and the proposed method is evaluated on different types of 2D poses and different lengths of the input sequence. Experimental results demonstrate the proposed architecture can significantly improve the performance of current sequence-to-sequence 3D pose estimators, with a reduction of at least 10.3% in the mean per joint position error (MPJPE) and at least 11.0% in the Procrustes MPJPE (P-MPJPE).
Paper Structure (21 sections, 9 equations, 5 figures, 4 tables)

This paper contains 21 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview framework of the proposed 3D human pose refinement method. A 3D human pose estimator takes a sequence of 2D human poses as input and outputs a 3D pose sequence with noise. The proposed method uses the 2D pose sequence as a condition to refine the noisy 3D pose sequence in $K$ steps.
  • Figure 2: Comparison of noise sampling schemes in the diffusion process. Solid black arrows denote the forward process in diffusion models, while dashed black arrows denote the reverse process. (a) visualizes the process of sampling noise from a standard Gaussian distribution $\mathcal{N}(0, \boldsymbol{I})$. The forward process gradually destroys the structure of the ground truth 3D pose until it conforms to $\mathcal{N}(0, \boldsymbol{I})$. (b) visualizes the process of sampling noise from a conditional noisy 3D pose distribution. It progressively transforms the ground truth 3D pose into a reasonable distribution of noisy 3D poses.
  • Figure 3: Overview of the proposed D3PRefiner. (a) shows the training pipeline. In the forward process, noise is sampled from the proposed conditional noisy 3D pose distribution and added to $\boldsymbol{x}_{0}$ with timestep $t$; then in the reverse process, PoseDenoiser is trained to predict error $\boldsymbol{e}$ in it. (b) shows the inference pipeline. The output of a 3D pose estimator is taken as $\boldsymbol{x}_{T}$ and fed into the PoseDenoiser with a 2D pose sequence $\boldsymbol{y}$. After $K$ denoising steps, noise in $\boldsymbol{x}_{T}$ is gradually removed, generating a denoised 3D pose sequence.
  • Figure 4: MPJPE distribution comparison on Human3.6M
  • Figure 5: Qualitative Results. Red, green, and blue skeletons represent the ground truth 3D pose, the estimated 3D pose and the refined 3D pose by the proposed D3PRefiner, respectively.