PAFUSE: Part-based Diffusion for 3D Whole-Body Pose Estimation

Nermin Samet; Cédric Rommel; David Picard; Eduardo Valle

PAFUSE: Part-based Diffusion for 3D Whole-Body Pose Estimation

Nermin Samet, Cédric Rommel, David Picard, Eduardo Valle

TL;DR

PAFUSE tackles 3D whole-body pose estimation from monocular video by addressing scale and motion variance across body parts (body, face, hands) with a hierarchical, part-based approach. It introduces a diffusion-based, part-conditioned framework where each body part is predicted in its own local frame anchored to a part root, enabling multi-hypothesis inference and joint optimization across parts. On the H3WB dataset, the method achieves state-of-the-art performance with $MPJPE=41.4$ mm and demonstrates strong improvements over baselines, including those using spatio-temporal cues and body mesh generation. The approach is modular, extensible to existing baselines, and validated through extensive ablations and qualitative in-the-wild results, highlighting practical impact for robust 3D whole-body pose estimation.

Abstract

We introduce a novel approach for 3D whole-body pose estimation, addressing the challenge of scale -- and deformability -- variance across body parts brought by the challenge of extending the 17 major joints on the human body to fine-grained keypoints on the face and hands. In addition to addressing the challenge of exploiting motion in unevenly sampled data, we combine stable diffusion to a hierarchical part representation which predicts the relative locations of fine-grained keypoints within each part (e.g., face) with respect to the part's local reference frame. On the H3WB dataset, our method greatly outperforms the current state of the art, which fails to exploit the temporal information. We also show considerable improvements compared to other spatiotemporal 3D human-pose estimation approaches that fail to account for the body part specificities. Code is available at https://github.com/valeoai/PAFUSE.

PAFUSE: Part-based Diffusion for 3D Whole-Body Pose Estimation

TL;DR

mm and demonstrates strong improvements over baselines, including those using spatio-temporal cues and body mesh generation. The approach is modular, extensible to existing baselines, and validated through extensive ablations and qualitative in-the-wild results, highlighting practical impact for robust 3D whole-body pose estimation.

Abstract

Paper Structure (28 sections, 3 equations, 6 figures, 3 tables)

This paper contains 28 sections, 3 equations, 6 figures, 3 tables.

Introduction
Related Work
3D whole-body pose estimation
2D Whole-Body Pose Estimation
3D human pose estimation
Method
Overview
Whole-Body-Frame vs. Part-Frame Shift.
Generative-based prediction.
Losses.
Novelty.
Results
Experimental setup
Dataset.
Metrics.
...and 13 more sections

Figures (6)

Figure 1: In a whole-body skeleton, different keypoints have different scales and variations (top left) which presents a challenge for spatio-temporal prediction. Current approaches (top right) process all keypoints in a single network and, as such, have difficulties adapting to the different statistics of each body part. Our approach (bottom) groups keypoints by body parts that share similar behavior and processes them with dedicated networks, allowing better-adapted predictions.
Figure 2: (Left:) COCO-whole-body cocowholebody layout used in the H3WB h3wb dataset, with 133 keypoints. In addition to the standard 17 main-body keypoints, there are 68 face-, 42 hand- (21 keypoints for each), and 6 foot- (3 for each) keypoints. (Right:) Example 2D and 3D whole-body pose pairs from the H3WB dataset. Images taken from cocowholebodyh3wb.
Figure 3: Overall processing pipeline. During training, we split the input samples into body-part-specific tensors before performing the forward noising process (top-left). Then we train the part-based conditional denoising diffusion models (top-right). During inference (bottom), we start from random Gaussian noise and iterate the part-based conditional denoising diffusion models $K$ times to obtain the skeleton parts, which we reconstruct into a whole-body skeleton. For simplicity, we omitted the temporal aspect, although our method actually processes video sequences consisting of N frames.
Figure 4: Distribution of gaps between annotated frames of H3WB dataset, showing a long tail with many gaps of 100 or more frames. Contrast to Human3.6M, which is evenly annotated at every 5 frames. Please remark the discontinuity at the y-axis, to make room for the mode at 5 frames.
Figure 5: Qualitative results from the H3WB test set. Blue: ground-truth, Red: best hypothesis. In comparison to D3DP, 's is better-aligned to the body joints (e.g., the shoulders), due to the hierarchical structure of the part-based prediction inducing such alignment.Remark also that 's dedicated networks for hands and face lead to considerably better predictions for those body parts.
...and 1 more figures

PAFUSE: Part-based Diffusion for 3D Whole-Body Pose Estimation

TL;DR

Abstract

PAFUSE: Part-based Diffusion for 3D Whole-Body Pose Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)