Table of Contents
Fetching ...

3D Hand Pose Estimation in Everyday Egocentric Images

Aditya Prakash, Ruisen Tu, Matthew Chang, Saurabh Gupta

TL;DR

WildHands is presented, a system for 3D hand pose estimation in everyday egocentric images that outperforms FrankMocap across all metrics and HaMeR on 3 out of 6 metrics while being 10x smaller and trained on 5x less data.

Abstract

3D hand pose estimation in everyday egocentric images is challenging for several reasons: poor visual signal (occlusion from the object of interaction, low resolution & motion blur), large perspective distortion (hands are close to the camera), and lack of 3D annotations outside of controlled settings. While existing methods often use hand crops as input to focus on fine-grained visual information to deal with poor visual signal, the challenges arising from perspective distortion and lack of 3D annotations in the wild have not been systematically studied. We focus on this gap and explore the impact of different practices, i.e. crops as input, incorporating camera information, auxiliary supervision, scaling up datasets. We provide several insights that are applicable to both convolutional and transformer models leading to better performance. Based on our findings, we also present WildHands, a system for 3D hand pose estimation in everyday egocentric images. Zero-shot evaluation on 4 diverse datasets (H2O, AssemblyHands, Epic-Kitchens, Ego-Exo4D) demonstrate the effectiveness of our approach across 2D and 3D metrics, where we beat past methods by 7.4% - 66%. In system level comparisons, WildHands achieves the best 3D hand pose on ARCTIC egocentric split, outperforms FrankMocap across all metrics and HaMeR on 3 out of 6 metrics while being 10x smaller and trained on 5x less data.

3D Hand Pose Estimation in Everyday Egocentric Images

TL;DR

WildHands is presented, a system for 3D hand pose estimation in everyday egocentric images that outperforms FrankMocap across all metrics and HaMeR on 3 out of 6 metrics while being 10x smaller and trained on 5x less data.

Abstract

3D hand pose estimation in everyday egocentric images is challenging for several reasons: poor visual signal (occlusion from the object of interaction, low resolution & motion blur), large perspective distortion (hands are close to the camera), and lack of 3D annotations outside of controlled settings. While existing methods often use hand crops as input to focus on fine-grained visual information to deal with poor visual signal, the challenges arising from perspective distortion and lack of 3D annotations in the wild have not been systematically studied. We focus on this gap and explore the impact of different practices, i.e. crops as input, incorporating camera information, auxiliary supervision, scaling up datasets. We provide several insights that are applicable to both convolutional and transformer models leading to better performance. Based on our findings, we also present WildHands, a system for 3D hand pose estimation in everyday egocentric images. Zero-shot evaluation on 4 diverse datasets (H2O, AssemblyHands, Epic-Kitchens, Ego-Exo4D) demonstrate the effectiveness of our approach across 2D and 3D metrics, where we beat past methods by 7.4% - 66%. In system level comparisons, WildHands achieves the best 3D hand pose on ARCTIC egocentric split, outperforms FrankMocap across all metrics and HaMeR on 3 out of 6 metrics while being 10x smaller and trained on 5x less data.
Paper Structure (12 sections, 2 equations, 6 figures, 8 tables)

This paper contains 12 sections, 2 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: WildHands predicts the 3D shape, 3D articulation and 3D placement of the hand in the camera frame from a single in-the-wild egocentric RGB image and camera intrinsics. It produces better 3D output compared to FrankMocap rong2020frankmocap in occlusion scenarios and is more adept at dealing with perspective distortion than HaMeR pavlakos2023reconstructing, in challenging egocentric hand-object interactions from Epic-Kitchens damen2018scaling dataset.
  • Figure 2: Model Overview. We crop the input images around the hand and process them using a convolutional backbone. The hand features along with the global image features (not shown above for clarity) and intrinsics-aware positional encoding (KPE Prakash2023Ambiguity) for each crop are fed to the decoder to predict the 3D hand. The hand decoders predict MANO parameters $\beta, \theta_\text{local}, \theta_\text{global}$ and camera translation which are converted to 3D keypoints & 2D keypoints and trained using 3D supervision on lab datasets, e.g. ARCTIC Fan2023CVPR, AssemblyHands ohkawa2023assemblyhands. We also use auxiliary supervision from in-the-wild Epic-Kitchens darkhalil2022visor dataset via hand segmentation masks and grasp labels. The hand masks are available with the VISOR dataset darkhalil2022visor whereas grasp labels are estimated using off-the-shelf model from cheng2023towards.
  • Figure 3: Epic-HandKps annotations. We collect 2D joint annotations (shown in blue) for 5K in-the-wild egocentric images from Epic-Kitchens damen2020collection. We show few annotations here with images cropped around the hand. We also have the label for the joint corresponding to each keypoint. Note the heavy occlusion & large variation in dexterous poses of hands interactiong with objects. More visualizations in supplementary.
  • Figure 4: Visualizations. We show projection of the predicted hand in the image & rendering of the hand mesh from 2 more views. WildHands predicts better hand poses from a single image than FrankMocap rong2020frankmocap, HaMeR Fan2023CVPR and ArcticNet Fan2023CVPR in challenging egocentric scenarios involving occlusions and perspective distortion.
  • Figure 5: Failure cases. We observe that images with (top) barely visible fingers, e.g. kneading dough or (bottom) extreme grasp poses are challenging for all models.
  • ...and 1 more figures