Table of Contents
Fetching ...

HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning

Eugene Valassakis, Guillermo Garcia-Hernando

TL;DR

HandDGP addresses the challenge of predicting camera-space hand meshes from a single RGB image by unifying root-relative hand mesh prediction with a differentiable global positioning (DGP) module and an input image rectification step. The DGP solves for the hand's global translation in camera space via a differentiable Direct Linear Transform on 2D-3D keypoint correspondences, allowing gradients to flow through the root-finding process and enabling end-to-end training. The framework jointly optimizes root-relative predictions and camera-space coordinates, with a weighted keypoint scheme and rectified intrinsics to mitigate depth and scale ambiguity. Empirical results across FreiHAND, HO3D-v2, and Human3.6M show state-of-the-art camera-space accuracy, highlighting the value of end-to-end learning, differentiable geometry, and geometry-based data canonicalization for realistic 3D hand interactions.

Abstract

Predicting camera-space hand meshes from single RGB images is crucial for enabling realistic hand interactions in 3D virtual and augmented worlds. Previous work typically divided the task into two stages: given a cropped image of the hand, predict meshes in relative coordinates, followed by lifting these predictions into camera space in a separate and independent stage, often resulting in the loss of valuable contextual and scale information. To prevent the loss of these cues, we propose unifying these two stages into an end-to-end solution that addresses the 2D-3D correspondence problem. This solution enables back-propagation from camera space outputs to the rest of the network through a new differentiable global positioning module. We also introduce an image rectification step that harmonizes both the training dataset and the input image as if they were acquired with the same camera, helping to alleviate the inherent scale-depth ambiguity of the problem. We validate the effectiveness of our framework in evaluations against several baselines and state-of-the-art approaches across three public benchmarks.

HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning

TL;DR

HandDGP addresses the challenge of predicting camera-space hand meshes from a single RGB image by unifying root-relative hand mesh prediction with a differentiable global positioning (DGP) module and an input image rectification step. The DGP solves for the hand's global translation in camera space via a differentiable Direct Linear Transform on 2D-3D keypoint correspondences, allowing gradients to flow through the root-finding process and enabling end-to-end training. The framework jointly optimizes root-relative predictions and camera-space coordinates, with a weighted keypoint scheme and rectified intrinsics to mitigate depth and scale ambiguity. Empirical results across FreiHAND, HO3D-v2, and Human3.6M show state-of-the-art camera-space accuracy, highlighting the value of end-to-end learning, differentiable geometry, and geometry-based data canonicalization for realistic 3D hand interactions.

Abstract

Predicting camera-space hand meshes from single RGB images is crucial for enabling realistic hand interactions in 3D virtual and augmented worlds. Previous work typically divided the task into two stages: given a cropped image of the hand, predict meshes in relative coordinates, followed by lifting these predictions into camera space in a separate and independent stage, often resulting in the loss of valuable contextual and scale information. To prevent the loss of these cues, we propose unifying these two stages into an end-to-end solution that addresses the 2D-3D correspondence problem. This solution enables back-propagation from camera space outputs to the rest of the network through a new differentiable global positioning module. We also introduce an image rectification step that harmonizes both the training dataset and the input image as if they were acquired with the same camera, helping to alleviate the inherent scale-depth ambiguity of the problem. We validate the effectiveness of our framework in evaluations against several baselines and state-of-the-art approaches across three public benchmarks.
Paper Structure (15 sections, 7 equations, 6 figures, 4 tables)

This paper contains 15 sections, 7 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Method overview: Through our Differentiable Global Positioning module (DGP) which predicts the root translation of the hand, our method is able to back-propagate through the root-finding operation, enabling an end-to-end solution.
  • Figure 2: HandDGP Framework Overview. Rectified images are passed through our framework, which predicts camera-space coordinates using our proposed DGP module.
  • Figure 3: Weight decoder head.
  • Figure 4: Keypoint selection. Effect of keypoint selection with our weight decoder. Test-set images on FreiHAND and HO3D-v2 with the 2D keypoints overlaid: The brighter the keypoint, the higher the weight.
  • Figure 5: (a)3D PCK for camera-space hand mesh prediction on FreiHAND. (b) Camera-space hand mesh predictions rotated for illustration purposes. All meshes project correctly in the image, however some predictions display a 3D error offset. (c) Root-relative vs camera-space errors. Selected FreiHAND images with average camera-space (CS) and root-relative (RS) errors and ground truth (mesh in white).
  • ...and 1 more figures