Table of Contents
Fetching ...

Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator

Gyeongsik Moon

Abstract

Accurately recovering hand poses within the body context remains a major challenge in 3D whole-body pose estimation. This difficulty arises from a fundamental supervision gap: whole-body pose estimators are trained on full-body datasets with limited hand diversity, while hand-only estimators, trained on hand-centric datasets, excel at detailed finger articulation but lack global body awareness. To address this, we propose Hand4Whole++, a modular framework that leverages the strengths of both pre-trained whole-body and hand pose estimators. We introduce CHAM (Conditional Hands Modulator), a lightweight module that modulates the whole-body feature stream using hand-specific features extracted from a pre-trained hand pose estimator. This modulation enables the whole-body model to predict wrist orientations that are both accurate and coherent with the upper-body kinematic structure, without retraining the full-body model. In parallel, we directly incorporate finger articulations and hand shapes predicted by the hand pose estimator, aligning them to the full-body mesh via differentiable rigid alignment. This design allows Hand4Whole++ to combine globally consistent body reasoning with fine-grained hand detail. Extensive experiments demonstrate that Hand4Whole++ substantially improves hand accuracy and enhances overall full-body pose quality.

Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator

Abstract

Accurately recovering hand poses within the body context remains a major challenge in 3D whole-body pose estimation. This difficulty arises from a fundamental supervision gap: whole-body pose estimators are trained on full-body datasets with limited hand diversity, while hand-only estimators, trained on hand-centric datasets, excel at detailed finger articulation but lack global body awareness. To address this, we propose Hand4Whole++, a modular framework that leverages the strengths of both pre-trained whole-body and hand pose estimators. We introduce CHAM (Conditional Hands Modulator), a lightweight module that modulates the whole-body feature stream using hand-specific features extracted from a pre-trained hand pose estimator. This modulation enables the whole-body model to predict wrist orientations that are both accurate and coherent with the upper-body kinematic structure, without retraining the full-body model. In parallel, we directly incorporate finger articulations and hand shapes predicted by the hand pose estimator, aligning them to the full-body mesh via differentiable rigid alignment. This design allows Hand4Whole++ to combine globally consistent body reasoning with fine-grained hand detail. Extensive experiments demonstrate that Hand4Whole++ substantially improves hand accuracy and enhances overall full-body pose quality.
Paper Structure (15 sections, 9 figures, 8 tables)

This paper contains 15 sections, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Comparison between (b) previous works and (c) the proposed Hand4Whole++. Hand pose estimators pavlakos2024reconstructingpotamias2024wilor recover each hand well but fail under interaction due to missing full-body context. Whole-body pose estimators cai2023smpler lack hand accuracy due to limited hand diversity in whole-body training data. Naïvely combining both leads to implausible hands, especially under occlusion. In contrast, Hand4Whole++ recovers accurate and plausible hands within full-body context.
  • Figure 2: Overview of Hand4Whole++, which comprises a pre-trained hand pose estimator, a pre-trained whole-body pose estimator, CHAM, and a finger articulation and shape transfer module. During training, only CHAM is updated, while the pre-trained pose estimators remain frozen.
  • Figure 3: Architecture of the proposed CHAM. The gray dashed box (2D positional encoding and cross-attention) is used only when both hands are detected. Otherwise, each hand feature is directly passed to its corresponding branch. For simplicity, we illustrate only three layers instead of the full 24-layer design.
  • Figure 4: Pipeline of the finger and shape transfer. We align the canonical 3D hand mesh to the initial whole-body mesh using the wrist and the four MCP joints (index, middle, ring, and pinky).
  • Figure 5: Effectiveness of the proposed CHAM.
  • ...and 4 more figures