$\text{Di}^2\text{Pose}$: Discrete Diffusion Model for Occluded 3D Human Pose Estimation
Weiquan Wang, Jun Xiao, Chunping Wang, Wei Liu, Zhao Wang, Long Chen
TL;DR
This work tackles occluded monocular 3D human pose estimation by marrying pose discreteness with diffusion modeling. It introduces Di^2 Pose, a two-stage framework that first quantizes 3D poses into discrete tokens via a VQ-VAE–style encoder/decoder with Local-MLP blocks, and then models those tokens with a conditional discrete diffusion process in latent space guided by a 2D image. The forward diffusion simulates occlusion via an Occ token and token replacement, while the reverse diffusion uses a transformer-based denoiser to recover poses, with a memory-efficient transition schedule. Empirically, Di^2 Pose achieves state-of-the-art performance on Human3.6M, 3DPW, and 3DPW-Occ, particularly excelling under occlusion, and ablations confirm the benefits of local joint modeling, occlusion-aware transitions, and higher diffusion steps for accuracy and robustness.
Abstract
Continuous diffusion models have demonstrated their effectiveness in addressing the inherent uncertainty and indeterminacy in monocular 3D human pose estimation (HPE). Despite their strengths, the need for large search spaces and the corresponding demand for substantial training data make these models prone to generating biomechanically unrealistic poses. This challenge is particularly noticeable in occlusion scenarios, where the complexity of inferring 3D structures from 2D images intensifies. In response to these limitations, we introduce the Discrete Diffusion Pose ($\text{Di}^2\text{Pose}$), a novel framework designed for occluded 3D HPE that capitalizes on the benefits of a discrete diffusion model. Specifically, $\text{Di}^2\text{Pose}$ employs a two-stage process: it first converts 3D poses into a discrete representation through a \emph{pose quantization step}, which is subsequently modeled in latent space through a \emph{discrete diffusion process}. This methodological innovation restrictively confines the search space towards physically viable configurations and enhances the model's capability to comprehend how occlusions affect human pose within the latent space. Extensive evaluations conducted on various benchmarks (e.g., Human3.6M, 3DPW, and 3DPW-Occ) have demonstrated its effectiveness.
