Cascaded Diffusion Framework for Probabilistic Coarse-to-Fine Hand Pose Estimation
Taeyun Woo, Jinah Park, Tae-Kyun Kim
TL;DR
This work introduces a probabilistic coarse-to-fine hand pose estimator by cascading diffusion models: a Stage 1 joint diffusion model samples diverse 3D hand joints from 2D cues, and a Stage 2 Mesh Latent Diffusion Model reconstructs the 3D hand mesh from a denoised latent conditioned on those joint samples and image features. By operating diffusion in a learned latent space and conditioning the mesh reconstruction on a distribution of plausible joints, the approach learns distribution-aware joint–mesh relationships and robust hand priors, improving performance under occlusion and articulation ambiguities. Experiments on FreiHAND and HO3Dv2 demonstrate state-of-the-art or competitive accuracy and strong best-of-N performance, while ablations confirm the benefits of latent-space diffusion, diverse joint conditioning, and the cascaded design for robustness. The method offers a practical, distribution-aware pipeline for accurate 3D hand mesh reconstruction, with potential extensions to multi-hand and hand–object interactions. overall, the paper provides a principled framework for probabilistic, coarse-to-fine hand pose estimation with strong empirical validation.
Abstract
Deterministic models for 3D hand pose reconstruction, whether single-staged or cascaded, struggle with pose ambiguities caused by self-occlusions and complex hand articulations. Existing cascaded approaches refine predictions in a coarse-to-fine manner but remain deterministic and cannot capture pose uncertainties. Recent probabilistic methods model pose distributions yet are restricted to single-stage estimation, which often fails to produce accurate 3D reconstructions without refinement. To address these limitations, we propose a coarse-to-fine cascaded diffusion framework that combines probabilistic modeling with cascaded refinement. The first stage is a joint diffusion model that samples diverse 3D joint hypotheses, and the second stage is a Mesh Latent Diffusion Model (Mesh LDM) that reconstructs a 3D hand mesh conditioned on a joint sample. By training Mesh LDM with diverse joint hypotheses in a learned latent space, our framework learns distribution-aware joint-mesh relationships and robust hand priors. Furthermore, the cascaded design mitigates the difficulty of directly mapping 2D images to dense 3D poses, enhancing accuracy through sequential refinement. Experiments on FreiHAND and HO3Dv2 demonstrate that our method achieves state-of-the-art performance while effectively modeling pose distributions.
