Table of Contents
Fetching ...

Reconstructing Hands in 3D with Transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, Jitendra Malik

TL;DR

HaMeR presents a transformer-based pipeline for monocular 3D hand mesh reconstruction that regresses MANO parameters and camera pose from RGB images. The approach hinges on large-scale data and a high-capacity ViT-H architecture, achieving state-of-the-art results on FreiHAND and HO3D, and demonstrating strong robustness in real-world, in-the-wild conditions via the new HInt dataset. By combining 3D supervision, 2D reprojection, and adversarial losses, HaMeR delivers precise, temporally stable hand reconstructions across occlusions and interactions. The introduction of HInt and the release of code and models aim to catalyze broader adoption and evaluation in robotics, action understanding, and sign-language research.

Abstract

We present an approach that can reconstruct hands in 3D from monocular input. Our approach for Hand Mesh Recovery, HaMeR, follows a fully transformer-based architecture and can analyze hands with significantly increased accuracy and robustness compared to previous work. The key to HaMeR's success lies in scaling up both the data used for training and the capacity of the deep network for hand reconstruction. For training data, we combine multiple datasets that contain 2D or 3D hand annotations. For the deep model, we use a large scale Vision Transformer architecture. Our final model consistently outperforms the previous baselines on popular 3D hand pose benchmarks. To further evaluate the effect of our design in non-controlled settings, we annotate existing in-the-wild datasets with 2D hand keypoint annotations. On this newly collected dataset of annotations, HInt, we demonstrate significant improvements over existing baselines. We make our code, data and models available on the project website: https://geopavlakos.github.io/hamer/.

Reconstructing Hands in 3D with Transformers

TL;DR

HaMeR presents a transformer-based pipeline for monocular 3D hand mesh reconstruction that regresses MANO parameters and camera pose from RGB images. The approach hinges on large-scale data and a high-capacity ViT-H architecture, achieving state-of-the-art results on FreiHAND and HO3D, and demonstrating strong robustness in real-world, in-the-wild conditions via the new HInt dataset. By combining 3D supervision, 2D reprojection, and adversarial losses, HaMeR delivers precise, temporally stable hand reconstructions across occlusions and interactions. The introduction of HInt and the release of code and models aim to catalyze broader adoption and evaluation in robotics, action understanding, and sign-language research.

Abstract

We present an approach that can reconstruct hands in 3D from monocular input. Our approach for Hand Mesh Recovery, HaMeR, follows a fully transformer-based architecture and can analyze hands with significantly increased accuracy and robustness compared to previous work. The key to HaMeR's success lies in scaling up both the data used for training and the capacity of the deep network for hand reconstruction. For training data, we combine multiple datasets that contain 2D or 3D hand annotations. For the deep model, we use a large scale Vision Transformer architecture. Our final model consistently outperforms the previous baselines on popular 3D hand pose benchmarks. To further evaluate the effect of our design in non-controlled settings, we annotate existing in-the-wild datasets with 2D hand keypoint annotations. On this newly collected dataset of annotations, HInt, we demonstrate significant improvements over existing baselines. We make our code, data and models available on the project website: https://geopavlakos.github.io/hamer/.
Paper Structure (16 sections, 3 equations, 4 figures, 5 tables)

This paper contains 16 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Monocular 3D hand mesh reconstruction. We propose HaMeR, a fully transformer-based approach for Hand Mesh Recovery. HaMeR achieves consistent improvements upon the state-of-the-art for 3D hand reconstruction. We can faithfully reconstruct hands in a wide variety of scenarios, including captures from different viewpoints (third person or egocentric), under occlusion, hands that interact with objects or other hands, hands with different skin tones, with gloves, from art paintings or mechanical hands. We encourage the reader to watch our reconstructions in the Supplemental Video to appreciate the temporal stability.
  • Figure 2: Dataset and Architecture.(Top) Hand crops with keypoint annotations from our HInt dataset of annotations for different image sources, Hands23cheng2023towards, Epic-KitchensVISOR2022damen2018scaling, and Ego4Dgrauman2022ego4d. We provide location annotations for 21 hand keypoints as well as the "occlusion" label for each joint. Occluded keypoints are marked using solid dot filled with black while non-occluded ones are filled with white. The pie chart shows the distribution and statistics of our dataset. (Bottom) The architecture for HaMeR follows a fully transformer-based design. We use a large scale ViT backbone dosovitskiy2020image followed by a transformer decoder to regress the parameters of the hand.
  • Figure 3: Qualitative comparison. We compare our approach qualitatively with state-of-the-art methods for hand mesh reconstruction. The previous baselines include METRO lin2021end, Mesh Graphormer lin2021mesh and FrankMocap rong2021frankmocap. METRO and Mesh Graphormer are non-parametric methods (regressing MANO vertices directly), while FrankMocap and HaMeR (ours) are parametric methods (regressing MANO parameters). The reconstructions from HaMeR are consistently better, particularly on more challenging examples, e.g., cases with motion blur, or images with hand-hand or hand-object interaction. We encourage the reader to also watch the Supplemental Video for more comparisons over time.
  • Figure 4: Qualitative results. We present qualitative results of our approach on the test set of HInt. We include images from New Days (row 1-2), VISOR (row 3-4), Ego4D (row 5-6), as well as various Internet images (row 7-8). HaMeR is particularly robust and can gracefully handle cases with heavy occlusion and interactions with objects or other hands.