Table of Contents
Fetching ...

$\text{Di}^2\text{Pose}$: Discrete Diffusion Model for Occluded 3D Human Pose Estimation

Weiquan Wang, Jun Xiao, Chunping Wang, Wei Liu, Zhao Wang, Long Chen

TL;DR

This work tackles occluded monocular 3D human pose estimation by marrying pose discreteness with diffusion modeling. It introduces Di^2 Pose, a two-stage framework that first quantizes 3D poses into discrete tokens via a VQ-VAE–style encoder/decoder with Local-MLP blocks, and then models those tokens with a conditional discrete diffusion process in latent space guided by a 2D image. The forward diffusion simulates occlusion via an Occ token and token replacement, while the reverse diffusion uses a transformer-based denoiser to recover poses, with a memory-efficient transition schedule. Empirically, Di^2 Pose achieves state-of-the-art performance on Human3.6M, 3DPW, and 3DPW-Occ, particularly excelling under occlusion, and ablations confirm the benefits of local joint modeling, occlusion-aware transitions, and higher diffusion steps for accuracy and robustness.

Abstract

Continuous diffusion models have demonstrated their effectiveness in addressing the inherent uncertainty and indeterminacy in monocular 3D human pose estimation (HPE). Despite their strengths, the need for large search spaces and the corresponding demand for substantial training data make these models prone to generating biomechanically unrealistic poses. This challenge is particularly noticeable in occlusion scenarios, where the complexity of inferring 3D structures from 2D images intensifies. In response to these limitations, we introduce the Discrete Diffusion Pose ($\text{Di}^2\text{Pose}$), a novel framework designed for occluded 3D HPE that capitalizes on the benefits of a discrete diffusion model. Specifically, $\text{Di}^2\text{Pose}$ employs a two-stage process: it first converts 3D poses into a discrete representation through a \emph{pose quantization step}, which is subsequently modeled in latent space through a \emph{discrete diffusion process}. This methodological innovation restrictively confines the search space towards physically viable configurations and enhances the model's capability to comprehend how occlusions affect human pose within the latent space. Extensive evaluations conducted on various benchmarks (e.g., Human3.6M, 3DPW, and 3DPW-Occ) have demonstrated its effectiveness.

$\text{Di}^2\text{Pose}$: Discrete Diffusion Model for Occluded 3D Human Pose Estimation

TL;DR

This work tackles occluded monocular 3D human pose estimation by marrying pose discreteness with diffusion modeling. It introduces Di^2 Pose, a two-stage framework that first quantizes 3D poses into discrete tokens via a VQ-VAE–style encoder/decoder with Local-MLP blocks, and then models those tokens with a conditional discrete diffusion process in latent space guided by a 2D image. The forward diffusion simulates occlusion via an Occ token and token replacement, while the reverse diffusion uses a transformer-based denoiser to recover poses, with a memory-efficient transition schedule. Empirically, Di^2 Pose achieves state-of-the-art performance on Human3.6M, 3DPW, and 3DPW-Occ, particularly excelling under occlusion, and ablations confirm the benefits of local joint modeling, occlusion-aware transitions, and higher diffusion steps for accuracy and robustness.

Abstract

Continuous diffusion models have demonstrated their effectiveness in addressing the inherent uncertainty and indeterminacy in monocular 3D human pose estimation (HPE). Despite their strengths, the need for large search spaces and the corresponding demand for substantial training data make these models prone to generating biomechanically unrealistic poses. This challenge is particularly noticeable in occlusion scenarios, where the complexity of inferring 3D structures from 2D images intensifies. In response to these limitations, we introduce the Discrete Diffusion Pose (), a novel framework designed for occluded 3D HPE that capitalizes on the benefits of a discrete diffusion model. Specifically, employs a two-stage process: it first converts 3D poses into a discrete representation through a \emph{pose quantization step}, which is subsequently modeled in latent space through a \emph{discrete diffusion process}. This methodological innovation restrictively confines the search space towards physically viable configurations and enhances the model's capability to comprehend how occlusions affect human pose within the latent space. Extensive evaluations conducted on various benchmarks (e.g., Human3.6M, 3DPW, and 3DPW-Occ) have demonstrated its effectiveness.
Paper Structure (23 sections, 23 equations, 6 figures, 6 tables, 2 algorithms)

This paper contains 23 sections, 23 equations, 6 figures, 6 tables, 2 algorithms.

Figures (6)

  • Figure 1: (a) Results of DiffPose gong2023diffpose and $\text{Di}^2\text{Pose}$ in Human3.6M von2018pw3d dataset (with MPJPE metric), across varying proportions of training samples. (b) Prediction results of two methods under occlusion.
  • Figure 2: Overview of our two-stage $\text{Di}^2\text{Pose}$ framework. In the stage 1, we train a pose quantization step that transforms a 3D pose $\mathbf{P}$ into multiple discrete tokens $\mathbf{k}$, each token representing the indices of implied codebook $\mathcal{C}$. In the stage 2, we model $\mathbf{k}$ in the discrete space by discrete diffusion process. In the forward process, each token is probabilistically occluded with Occ token or replaced with another available token. In the reverse process, the model leverages an independent image encoder and a pose denoiser to reconstruct all the tokens based on the condition 2D image. These reconstructed tokens are finally decoded by the pose decoder, resulting in the recovered 3D pose. Notably, we only update the parameters of pose denoiser, pose decoder and image encoder are frozen.
  • Figure 3: (a) depicts the structure of the Local-MLP block; (b) shows the Joint Shift operation, where the arrows indicate the steps, and different subscript numbers represent the features of different joints. The gray blocks indicate zero padding.
  • Figure 4: Qualitative results on two datasets. Joints on the right side are marked in green, while other joints are highlighted in blue.
  • Figure 5: Failure cases of our $\text{Di}^2\text{Pose}$ for 3D HPE. These instances primarily occur in scenarios with severe occlusions, as compared against ground truth (GT) poses. The content encircled by the dashed line indicates the parts where differences exist.
  • ...and 1 more figures