PAFUSE: Part-based Diffusion for 3D Whole-Body Pose Estimation
Nermin Samet, Cédric Rommel, David Picard, Eduardo Valle
TL;DR
PAFUSE tackles 3D whole-body pose estimation from monocular video by addressing scale and motion variance across body parts (body, face, hands) with a hierarchical, part-based approach. It introduces a diffusion-based, part-conditioned framework where each body part is predicted in its own local frame anchored to a part root, enabling multi-hypothesis inference and joint optimization across parts. On the H3WB dataset, the method achieves state-of-the-art performance with $MPJPE=41.4$ mm and demonstrates strong improvements over baselines, including those using spatio-temporal cues and body mesh generation. The approach is modular, extensible to existing baselines, and validated through extensive ablations and qualitative in-the-wild results, highlighting practical impact for robust 3D whole-body pose estimation.
Abstract
We introduce a novel approach for 3D whole-body pose estimation, addressing the challenge of scale -- and deformability -- variance across body parts brought by the challenge of extending the 17 major joints on the human body to fine-grained keypoints on the face and hands. In addition to addressing the challenge of exploiting motion in unevenly sampled data, we combine stable diffusion to a hierarchical part representation which predicts the relative locations of fine-grained keypoints within each part (e.g., face) with respect to the part's local reference frame. On the H3WB dataset, our method greatly outperforms the current state of the art, which fails to exploit the temporal information. We also show considerable improvements compared to other spatiotemporal 3D human-pose estimation approaches that fail to account for the body part specificities. Code is available at https://github.com/valeoai/PAFUSE.
