Table of Contents
Fetching ...

WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild

Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, Stefanos Zafeiriou

TL;DR

WiLoR tackles in-the-wild multi-hand localization and 3D reconstruction by coupling a real-time FCN hand detector with a transformer-based 3D hand pose estimator that includes a refinement module for image-aligned features. A key contribution is the WHIM dataset, a large-scale in-the-wild corpus of over 2M hand images with 2D/3D annotations and biomechanical and 3D priors to enable robust learning. Empirically, WiLoR achieves state-of-the-art results on FreiHAND and HO3D, records real-time speeds (>130 FPS), and exhibits improved temporal coherence for monocular video without temporal modeling. The work delivers a practical, end-to-end solution for multi-hand detection, localization, and 3D reconstruction with potential impact on AR/VR and robotics.

Abstract

In recent years, 3D hand pose estimation methods have garnered significant attention due to their extensive applications in human-computer interaction, virtual reality, and robotics. In contrast, there has been a notable gap in hand detection pipelines, posing significant challenges in constructing effective real-world multi-hand reconstruction systems. In this work, we present a data-driven pipeline for efficient multi-hand reconstruction in the wild. The proposed pipeline is composed of two components: a real-time fully convolutional hand localization and a high-fidelity transformer-based 3D hand reconstruction model. To tackle the limitations of previous methods and build a robust and stable detection network, we introduce a large-scale dataset with over than 2M in-the-wild hand images with diverse lighting, illumination, and occlusion conditions. Our approach outperforms previous methods in both efficiency and accuracy on popular 2D and 3D benchmarks. Finally, we showcase the effectiveness of our pipeline to achieve smooth 3D hand tracking from monocular videos, without utilizing any temporal components. Code, models, and dataset are available https://rolpotamias.github.io/WiLoR.

WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild

TL;DR

WiLoR tackles in-the-wild multi-hand localization and 3D reconstruction by coupling a real-time FCN hand detector with a transformer-based 3D hand pose estimator that includes a refinement module for image-aligned features. A key contribution is the WHIM dataset, a large-scale in-the-wild corpus of over 2M hand images with 2D/3D annotations and biomechanical and 3D priors to enable robust learning. Empirically, WiLoR achieves state-of-the-art results on FreiHAND and HO3D, records real-time speeds (>130 FPS), and exhibits improved temporal coherence for monocular video without temporal modeling. The work delivers a practical, end-to-end solution for multi-hand detection, localization, and 3D reconstruction with potential impact on AR/VR and robotics.

Abstract

In recent years, 3D hand pose estimation methods have garnered significant attention due to their extensive applications in human-computer interaction, virtual reality, and robotics. In contrast, there has been a notable gap in hand detection pipelines, posing significant challenges in constructing effective real-world multi-hand reconstruction systems. In this work, we present a data-driven pipeline for efficient multi-hand reconstruction in the wild. The proposed pipeline is composed of two components: a real-time fully convolutional hand localization and a high-fidelity transformer-based 3D hand reconstruction model. To tackle the limitations of previous methods and build a robust and stable detection network, we introduce a large-scale dataset with over than 2M in-the-wild hand images with diverse lighting, illumination, and occlusion conditions. Our approach outperforms previous methods in both efficiency and accuracy on popular 2D and 3D benchmarks. Finally, we showcase the effectiveness of our pipeline to achieve smooth 3D hand tracking from monocular videos, without utilizing any temporal components. Code, models, and dataset are available https://rolpotamias.github.io/WiLoR.
Paper Structure (18 sections, 8 equations, 7 figures, 7 tables)

This paper contains 18 sections, 8 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: We propose WiLoR, a full-stack in-the-Wild Localization and 3D hand Reconstruction method. WiLoR first localizes and defines the handedness of the detected hands which are then lifted to 3D using a transformer-based hand pose estimation module. To aid high-fidelity reconstructions and facilitate image-alignment, we introduce a refinement module that extracts localized features to correct misaligned poses. WiLoR achieves state-of-the-art performance under different benchmark datasets while boosting the temporal coherence of image-based 3D hand pose estimation methods.
  • Figure 2: Example of the proposed WHIM in-the-wild dataset.
  • Figure 3: Detection overview: The proposed fully convolutional one-stage hand detection method receives an image and extracts multi-resolution feature maps that are then processed by the Path Aggregation Network (PANet). The corresponding features are then fed to three detection heads that predict the hand side, bounding box, and hand joints at different resolutions. We train the network with a multi-task loss for each anchor.
  • Figure 4: Overview of the proposed 3D hand pose estimation method: Given an image $\mathbf{I}_h$ represented as a series of feature tokens $\mathbf{T}_{img}$ along with a set of learnable camera $\mathbf{T}_{cam}$, pose $\mathbf{T}_{pose}$ and shape $\mathbf{T}_{shape}$ tokens, we initially predict a rough estimation of the MANO mano and camera $\mathbf{K}_{cam}$ parameters using a ViT backbone (light blue). The updated image tokens are then reshaped and upsampled through a series of deconvolutional layers to form a set of multi-resolution feature maps $\{\mathbf{F}_{0},...,\mathbf{F}_{0}\}$. We then project the estimated 3D hand to the generated feature maps and sample image-aligned multi-scale features through a novel refinement module (purple). The sampled features are used to predict pose and shape residuals $\Delta\theta, \Delta\beta$ that refine the coarse hand estimation. Using this coarse-to-fine pose estimation strategy we facilitate image alignment and achieve better reconstruction performance.
  • Figure 5: Qualitative Evaluation of the proposed hand detection network on in-the-wild images. The proposed model demonstrates robustness across various lighting conditions, resolutions, hand scales, and even in the presence of motion blur.
  • ...and 2 more figures