Table of Contents
Fetching ...

Reconstructing Hand-Held Objects in 3D from Images and Videos

Jane Wu, Georgios Pavlakos, Georgia Gkioxari, Jitendra Malik

TL;DR

This work tackles reconstructing 3D hand-held objects from monocular RGB data by coupling a hand-grounded implicit representation with retrieval-based object augmentation. It introduces MCC-HO, a transformer-based model that jointly infers hand and object geometry via a neural implicit surface ρ(x) = (σ(x), c(x), m(x)), conditioned on a 3D hand and an RGB image, and Retrieval-Augmented Reconstruction (RAR), which uses GPT-4V and a text-to-3D model to retrieve a matching object and align it to the scene. A rigid, temporally-consistent alignment is enforced across frames using discretized SO(3) rotations, translations, and a Viterbi-based global optimization that combines Chamfer Distance and DINOv2 similarity. Experiments on DexYCB, MOW, HOI4D, and 100DOH show state-of-the-art performance and demonstrate the scalability of RAR for creating large hand-object 3D datasets, offering a practical data engine for robotics and imitation learning. The approach highlights the value of fusing strong hand priors with modern vision-language generative models to overcome data scarcity in hand-object 3D reconstruction.

Abstract

Objects manipulated by the hand (i.e., manipulanda) are particularly challenging to reconstruct from Internet videos. Not only does the hand occlude much of the object, but also the object is often only visible in a small number of image pixels. At the same time, two strong anchors emerge in this setting: (1) estimated 3D hands help disambiguate the location and scale of the object, and (2) the set of manipulanda is small relative to all possible objects. With these insights in mind, we present a scalable paradigm for hand-held object reconstruction that builds on recent breakthroughs in large language/vision models and 3D object datasets. Given a monocular RGB video, we aim to reconstruct hand-held object geometry in 3D, over time. In order to obtain the best performing single frame model, we first present MCC-Hand-Object (MCC-HO), which jointly reconstructs hand and object geometry given a single RGB image and inferred 3D hand as inputs. Subsequently, we prompt a text-to-3D generative model using GPT-4(V) to retrieve a 3D object model that matches the object in the image(s); we call this alignment Retrieval-Augmented Reconstruction (RAR). RAR provides unified object geometry across all frames, and the result is rigidly aligned with both the input images and 3D MCC-HO observations in a temporally consistent manner. Experiments demonstrate that our approach achieves state-of-the-art performance on lab and Internet image/video datasets. We make our code and models available on the project website: https://janehwu.github.io/mcc-ho

Reconstructing Hand-Held Objects in 3D from Images and Videos

TL;DR

This work tackles reconstructing 3D hand-held objects from monocular RGB data by coupling a hand-grounded implicit representation with retrieval-based object augmentation. It introduces MCC-HO, a transformer-based model that jointly infers hand and object geometry via a neural implicit surface ρ(x) = (σ(x), c(x), m(x)), conditioned on a 3D hand and an RGB image, and Retrieval-Augmented Reconstruction (RAR), which uses GPT-4V and a text-to-3D model to retrieve a matching object and align it to the scene. A rigid, temporally-consistent alignment is enforced across frames using discretized SO(3) rotations, translations, and a Viterbi-based global optimization that combines Chamfer Distance and DINOv2 similarity. Experiments on DexYCB, MOW, HOI4D, and 100DOH show state-of-the-art performance and demonstrate the scalability of RAR for creating large hand-object 3D datasets, offering a practical data engine for robotics and imitation learning. The approach highlights the value of fusing strong hand priors with modern vision-language generative models to overcome data scarcity in hand-object 3D reconstruction.

Abstract

Objects manipulated by the hand (i.e., manipulanda) are particularly challenging to reconstruct from Internet videos. Not only does the hand occlude much of the object, but also the object is often only visible in a small number of image pixels. At the same time, two strong anchors emerge in this setting: (1) estimated 3D hands help disambiguate the location and scale of the object, and (2) the set of manipulanda is small relative to all possible objects. With these insights in mind, we present a scalable paradigm for hand-held object reconstruction that builds on recent breakthroughs in large language/vision models and 3D object datasets. Given a monocular RGB video, we aim to reconstruct hand-held object geometry in 3D, over time. In order to obtain the best performing single frame model, we first present MCC-Hand-Object (MCC-HO), which jointly reconstructs hand and object geometry given a single RGB image and inferred 3D hand as inputs. Subsequently, we prompt a text-to-3D generative model using GPT-4(V) to retrieve a 3D object model that matches the object in the image(s); we call this alignment Retrieval-Augmented Reconstruction (RAR). RAR provides unified object geometry across all frames, and the result is rigidly aligned with both the input images and 3D MCC-HO observations in a temporally consistent manner. Experiments demonstrate that our approach achieves state-of-the-art performance on lab and Internet image/video datasets. We make our code and models available on the project website: https://janehwu.github.io/mcc-ho
Paper Structure (16 sections, 7 equations, 6 figures, 5 tables)

This paper contains 16 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: We present a scalable approach to hand-held object reconstruction from monocular RGB images or videos that is guided by object recognition and retrieval. Results demonstrate that our method is able to generate realistic object geometry that is also faithful to visual observations and consistent across frames. Please see the supplementary materials for evaluation of temporal consistency.
  • Figure 2: Given an RGB video and estimated 3D hands, our method reconstructs 3D hand-held object trajectories. First, MCC-HO is used to predict hand and object point clouds for each frame (Section \ref{['sec:mccho']}). Then, a single 3D model for the object is obtained using Retrieval-Augmented Reconstruction (Section \ref{['sec:rar']}). The 3D object model is rigidly aligned with DINOv2 oquab2023dinov2 image features and network-inferred geometry in a temporally consistent manner via our Viterbi algorithm (Section \ref{['sec:temporal_alignment']}).
  • Figure 3: We compute a PCA basis of DINOv2 features using all frames masked by the object silhouettes (one frame and its first three PCA components are shown on the left side of each example). The first three components of this basis are used to determine the maximum likelihood Genie object rotation for each frame (shown on the right side).
  • Figure 4: MCC-HO results for MOW test examples. The input image (top, left), network-inferred hand-object point cloud (top, right), RAR (bottom, left), and an alternative view of the point cloud (bottom, right) are shown.
  • Figure 5: Qualitative comparisons using MOW test examples. *Note that the MCC-HO point clouds are rendered as a mesh via Poisson surface reconstruction, which can introduce artifacts not attributed to our method. The last column is MCC-HO + RAR.
  • ...and 1 more figures