Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation

Rui Hong; Jana Kosecka

Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation

Rui Hong, Jana Kosecka

Abstract

Estimating 3D hand pose from monocular RGB images is fundamental for applications in AR/VR, human-computer interaction, and sign language understanding. In this work we focus on a scenario where a discrete set of gesture labels is available and show that gesture semantics can serve as a powerful inductive bias for 3D pose estimation. We present a two-stage framework: gesture-aware pretraining that learns an informative embedding space using coarse and fine gesture labels from InterHand2.6M, followed by a per-joint token Transformer guided by gesture embeddings as intermediate representations for final regression of MANO hand parameters. Training is driven by a layered objective over parameters, joints, and structural constraints. Experiments on InterHand2.6M demonstrate that gesture-aware pretraining consistently improves single-hand accuracy over the state-of-the-art EANet baseline, and that the benefit transfers across architectures without any modification.

Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation

Abstract

Paper Structure (21 sections, 12 equations, 6 figures, 2 tables)

This paper contains 21 sections, 12 equations, 6 figures, 2 tables.

Introduction
Overview of our approach.
Related Work
Single-Hand 3D Pose Estimation.
Two-Hand Interactions.
Gesture Semantics.
Methodology
Stage 1: Gesture-Aware Pretraining
Gesture label construction.
Stage 2: Gesture-Guided 3D Pose Estimation
Per-Joint Tokenization via 2.5D Volumetric Reasoning
Transformer Module with Gesture Guidance Tokens
MANO Parameter Regression
Training Objective
Experiments
...and 6 more sections

Figures (6)

Figure 1: Stage 1: gesture-aware pretraining. HRNet global features $\mathbf{g}$ feed two classification heads. A gradual weighting schedule stabilizes coarse categories first before incorporating fine distinctions.
Figure 2: Example gesture images. Top: "fingerspreadnormal" and "five_count" (same label). Bottom: "thumbup_normal" and "thumbup_relaxed" (same coarse, different fine label).
Figure 3: Stage 2 pipeline. Multi-scale features ($F_4$, $F_5$) feed the per-joint tokenization module (\ref{['sec:tokenization']}). Stage 1 gesture logits $(\boldsymbol{\gamma}_{\text{coarse}}, \boldsymbol{\gamma}_{\text{fine}})$ are embedded and injected into the gesture-guided Transformer (\ref{['sec:fusiontransformer']}). Outputs are decoded to MANO parameters $(\theta, \beta)$ (\ref{['sec:mano']}).
Figure 4: t-SNE of HRNet pooled features on InterHand2.6M test set, colored by coarse gesture label. Gesture pretraining yields more discriminative representations.
Figure 5: t-SNE of classifier outputs on InterHand2.6M test set. Coarse and fine logits exhibit different cluster granularities, consistent with the coarse-to-fine supervision design.
...and 1 more figures

Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation

Abstract

Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation

Authors

Abstract

Table of Contents

Figures (6)