Table of Contents
Fetching ...

Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation

Rui Hong, Jana Kosecka

Abstract

Estimating 3D hand pose from monocular RGB images is fundamental for applications in AR/VR, human-computer interaction, and sign language understanding. In this work we focus on a scenario where a discrete set of gesture labels is available and show that gesture semantics can serve as a powerful inductive bias for 3D pose estimation. We present a two-stage framework: gesture-aware pretraining that learns an informative embedding space using coarse and fine gesture labels from InterHand2.6M, followed by a per-joint token Transformer guided by gesture embeddings as intermediate representations for final regression of MANO hand parameters. Training is driven by a layered objective over parameters, joints, and structural constraints. Experiments on InterHand2.6M demonstrate that gesture-aware pretraining consistently improves single-hand accuracy over the state-of-the-art EANet baseline, and that the benefit transfers across architectures without any modification.

Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation

Abstract

Estimating 3D hand pose from monocular RGB images is fundamental for applications in AR/VR, human-computer interaction, and sign language understanding. In this work we focus on a scenario where a discrete set of gesture labels is available and show that gesture semantics can serve as a powerful inductive bias for 3D pose estimation. We present a two-stage framework: gesture-aware pretraining that learns an informative embedding space using coarse and fine gesture labels from InterHand2.6M, followed by a per-joint token Transformer guided by gesture embeddings as intermediate representations for final regression of MANO hand parameters. Training is driven by a layered objective over parameters, joints, and structural constraints. Experiments on InterHand2.6M demonstrate that gesture-aware pretraining consistently improves single-hand accuracy over the state-of-the-art EANet baseline, and that the benefit transfers across architectures without any modification.
Paper Structure (21 sections, 12 equations, 6 figures, 2 tables)

This paper contains 21 sections, 12 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Stage 1: gesture-aware pretraining. HRNet global features $\mathbf{g}$ feed two classification heads. A gradual weighting schedule stabilizes coarse categories first before incorporating fine distinctions.
  • Figure 2: Example gesture images. Top: "fingerspreadnormal" and "five_count" (same label). Bottom: "thumbup_normal" and "thumbup_relaxed" (same coarse, different fine label).
  • Figure 3: Stage 2 pipeline. Multi-scale features ($F_4$, $F_5$) feed the per-joint tokenization module (\ref{['sec:tokenization']}). Stage 1 gesture logits $(\boldsymbol{\gamma}_{\text{coarse}}, \boldsymbol{\gamma}_{\text{fine}})$ are embedded and injected into the gesture-guided Transformer (\ref{['sec:fusiontransformer']}). Outputs are decoded to MANO parameters $(\theta, \beta)$ (\ref{['sec:mano']}).
  • Figure 4: t-SNE of HRNet pooled features on InterHand2.6M test set, colored by coarse gesture label. Gesture pretraining yields more discriminative representations.
  • Figure 5: t-SNE of classifier outputs on InterHand2.6M test set. Coarse and fine logits exhibit different cluster granularities, consistent with the coarse-to-fine supervision design.
  • ...and 1 more figures