Reconstructing Hand-Held Objects in 3D from Images and Videos
Jane Wu, Georgios Pavlakos, Georgia Gkioxari, Jitendra Malik
TL;DR
This work tackles reconstructing 3D hand-held objects from monocular RGB data by coupling a hand-grounded implicit representation with retrieval-based object augmentation. It introduces MCC-HO, a transformer-based model that jointly infers hand and object geometry via a neural implicit surface ρ(x) = (σ(x), c(x), m(x)), conditioned on a 3D hand and an RGB image, and Retrieval-Augmented Reconstruction (RAR), which uses GPT-4V and a text-to-3D model to retrieve a matching object and align it to the scene. A rigid, temporally-consistent alignment is enforced across frames using discretized SO(3) rotations, translations, and a Viterbi-based global optimization that combines Chamfer Distance and DINOv2 similarity. Experiments on DexYCB, MOW, HOI4D, and 100DOH show state-of-the-art performance and demonstrate the scalability of RAR for creating large hand-object 3D datasets, offering a practical data engine for robotics and imitation learning. The approach highlights the value of fusing strong hand priors with modern vision-language generative models to overcome data scarcity in hand-object 3D reconstruction.
Abstract
Objects manipulated by the hand (i.e., manipulanda) are particularly challenging to reconstruct from Internet videos. Not only does the hand occlude much of the object, but also the object is often only visible in a small number of image pixels. At the same time, two strong anchors emerge in this setting: (1) estimated 3D hands help disambiguate the location and scale of the object, and (2) the set of manipulanda is small relative to all possible objects. With these insights in mind, we present a scalable paradigm for hand-held object reconstruction that builds on recent breakthroughs in large language/vision models and 3D object datasets. Given a monocular RGB video, we aim to reconstruct hand-held object geometry in 3D, over time. In order to obtain the best performing single frame model, we first present MCC-Hand-Object (MCC-HO), which jointly reconstructs hand and object geometry given a single RGB image and inferred 3D hand as inputs. Subsequently, we prompt a text-to-3D generative model using GPT-4(V) to retrieve a 3D object model that matches the object in the image(s); we call this alignment Retrieval-Augmented Reconstruction (RAR). RAR provides unified object geometry across all frames, and the result is rigidly aligned with both the input images and 3D MCC-HO observations in a temporally consistent manner. Experiments demonstrate that our approach achieves state-of-the-art performance on lab and Internet image/video datasets. We make our code and models available on the project website: https://janehwu.github.io/mcc-ho
