Table of Contents
Fetching ...

Get a Grip: Reconstructing Hand-Object Stable Grasps in Egocentric Videos

Zhifan Zhu, Dima Damen

TL;DR

This work defines Hand-Object Stable Grasp Reconstruction (HO-SGR) to reconstruct hand–object poses over stable grasp intervals in egocentric video, revealing that objects typically rotate around a latent axis with 1-DoF motion within a grasp. It proposes a render-and-compare optimisation that jointly estimates a global object pose and per-frame rotations around a learned axis, constrained by three terms (mask, push, pull) and guided by a differentiable projection, with hand poses supplied by HaMeR. A new in-the-wild dataset, EPIC-Grasps, documents 2.4K stable grasp clips across 9 object categories, 141 environments, and 2D masks as pseudo-ground truth, enabling evaluation without full 3D ground truth. Across ARCTIC-Grasps and EPIC-Grasps, the 1-DoF approach yields consistently higher stable contact preservation and pose plausibility than baselines, though limitations remain for CAD-agnostic methods and severe occlusions; the work paves the way for quantitative, in-the-wild evaluation of hand–object reconstruction.

Abstract

We propose the task of Hand-Object Stable Grasp Reconstruction (HO-SGR), the reconstruction of frames during which the hand is stably holding the object. We first develop the stable grasp definition based on the intuition that the in-contact area between the hand and object should remain stable. By analysing the 3D ARCTIC dataset, we identify stable grasp durations and showcase that objects in stable grasps move within a single degree of freedom (1-DoF). We thereby propose a method to jointly optimise all frames within a stable grasp, minimising object motions to a latent 1-DoF. Finally, we extend the knowledge to in-the-wild videos by labelling 2.4K clips of stable grasps. Our proposed EPIC-Grasps dataset includes 390 object instances of 9 categories, featuring stable grasps from videos of daily interactions in 141 environments. Without 3D ground truth, we use stable contact areas and 2D projection masks to assess the HO-SGR task in the wild. We evaluate relevant methods and our approach preserves significantly higher stable contact area, on both EPIC-Grasps and stable grasp sub-sequences from the ARCTIC dataset.

Get a Grip: Reconstructing Hand-Object Stable Grasps in Egocentric Videos

TL;DR

This work defines Hand-Object Stable Grasp Reconstruction (HO-SGR) to reconstruct hand–object poses over stable grasp intervals in egocentric video, revealing that objects typically rotate around a latent axis with 1-DoF motion within a grasp. It proposes a render-and-compare optimisation that jointly estimates a global object pose and per-frame rotations around a learned axis, constrained by three terms (mask, push, pull) and guided by a differentiable projection, with hand poses supplied by HaMeR. A new in-the-wild dataset, EPIC-Grasps, documents 2.4K stable grasp clips across 9 object categories, 141 environments, and 2D masks as pseudo-ground truth, enabling evaluation without full 3D ground truth. Across ARCTIC-Grasps and EPIC-Grasps, the 1-DoF approach yields consistently higher stable contact preservation and pose plausibility than baselines, though limitations remain for CAD-agnostic methods and severe occlusions; the work paves the way for quantitative, in-the-wild evaluation of hand–object reconstruction.

Abstract

We propose the task of Hand-Object Stable Grasp Reconstruction (HO-SGR), the reconstruction of frames during which the hand is stably holding the object. We first develop the stable grasp definition based on the intuition that the in-contact area between the hand and object should remain stable. By analysing the 3D ARCTIC dataset, we identify stable grasp durations and showcase that objects in stable grasps move within a single degree of freedom (1-DoF). We thereby propose a method to jointly optimise all frames within a stable grasp, minimising object motions to a latent 1-DoF. Finally, we extend the knowledge to in-the-wild videos by labelling 2.4K clips of stable grasps. Our proposed EPIC-Grasps dataset includes 390 object instances of 9 categories, featuring stable grasps from videos of daily interactions in 141 environments. Without 3D ground truth, we use stable contact areas and 2D projection masks to assess the HO-SGR task in the wild. We evaluate relevant methods and our approach preserves significantly higher stable contact area, on both EPIC-Grasps and stable grasp sub-sequences from the ARCTIC dataset.
Paper Structure (21 sections, 8 equations, 13 figures, 9 tables)

This paper contains 21 sections, 8 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Two stable grasp sequences from EPIC-Grasps for a bottle (left) and bowl (right). We show sample frames (top) and reconstructions (bottom). Right: for each reconstruction, we show the rotated view, along with the latent 1-DoF axis.
  • Figure 2: Sample hand-object mesh sequence from ARCTIC. Contact areas (in shiny yellow) are similar within the stable grasp (blue background). In -0.16s the hand has no contact with the object.
  • Figure 3: To study the object's relative pose, we align the hand coordinate systems (right). Top: Made-up sequence with static relative pose -- object is perfectly aligned (mixture of 3 colours). Bottom: True sequence showcasing object's motion relative to the hand can be approximated as 1-DoF rotation around axis $\phi$, shown in purple.
  • Figure 4: We compare within/outside grasps, analysing object in-contact area (left) and corresponding rotation errors of the static and 1-DoF rotation approximations (right), normalising all stable grasp duration for direct comparison (0 to 1 marked with blue background). While both the Static and 1-DoF assumptions result in low approximation error within stable grasps, the error of 1-DoF assumption is marginal (right).
  • Figure 5: Our proposed reconstruction method. We show 3 frames within a stable grasp. HaMeR pavlakos2024reconstructing produces the hand meshes (rendered in blue) from RGB, and we set the object-to-hand pose $T_{o2h}^{n}$ to the same $T_{o2h}$ initially. Then, during each iteration of the optimisation, the object's relative pose is optimised to 1-DoF and projected back to individual frames. These are compared with ground truth segmentation (right), jointly optimise for all frames. We ignore mask computation in hand occluded region (grey in the right figure). The physical terms are omitted in this figure.
  • ...and 8 more figures