Table of Contents
Fetching ...

GenHeld: Generating and Editing Handheld Objects

Chaerin Min, Srinath Sridhar

TL;DR

GenHeld tackles the inverse problem of generating held objects conditioned on a hand input, introducing a 3D pipeline (GenHeld3D) that learns compact object codes to retrieve plausible objects from a large dataset and fits them without altering the hand pose, and a 2D pipeline (GenHeld2D) that edits hand images by leveraging 3D guidance in diffusion-based editing. The object codes link hand pose to diverse graspable geometries, enabling fast, plausible object fitting with contact-aware and physically plausible constraints; GenHeld2D extends this by projecting 3D guidance into 2D image edits using DDIM-based inversion and gradient-guided diffusion with occlusion awareness. Experimental results show improved grasp quality, faster convergence in object fitting, and higher 2D editing plausibility compared with baselines and image-editing methods that lack 3D guidance. The work advances practical hand-object synthesis for VR, robotics, and image editing, while highlighting limitations in speed and open-vocabulary scalability and proposing future safeguards for misuse.

Abstract

Grasping is an important human activity that has long been studied in robotics, computer vision, and cognitive science. Most existing works study grasping from the perspective of synthesizing hand poses conditioned on 3D or 2D object representations. We propose GenHeld to address the inverse problem of synthesizing held objects conditioned on 3D hand model or 2D image. Given a 3D model of hand, GenHeld 3D can select a plausible held object from a large dataset using compact object representations called object codes.The selected object is then positioned and oriented to form a plausible grasp without changing hand pose. If only a 2D hand image is available, GenHeld 2D can edit this image to add or replace a held object. GenHeld 2D operates by combining the abilities of GenHeld 3D with diffusion-based image editing. Results and experiments show that we outperform baselines and can generate plausible held objects in both 2D and 3D. Our experiments demonstrate that our method achieves high quality and plausibility of held object synthesis in both 3D and 2D.

GenHeld: Generating and Editing Handheld Objects

TL;DR

GenHeld tackles the inverse problem of generating held objects conditioned on a hand input, introducing a 3D pipeline (GenHeld3D) that learns compact object codes to retrieve plausible objects from a large dataset and fits them without altering the hand pose, and a 2D pipeline (GenHeld2D) that edits hand images by leveraging 3D guidance in diffusion-based editing. The object codes link hand pose to diverse graspable geometries, enabling fast, plausible object fitting with contact-aware and physically plausible constraints; GenHeld2D extends this by projecting 3D guidance into 2D image edits using DDIM-based inversion and gradient-guided diffusion with occlusion awareness. Experimental results show improved grasp quality, faster convergence in object fitting, and higher 2D editing plausibility compared with baselines and image-editing methods that lack 3D guidance. The work advances practical hand-object synthesis for VR, robotics, and image editing, while highlighting limitations in speed and open-vocabulary scalability and proposing future safeguards for misuse.

Abstract

Grasping is an important human activity that has long been studied in robotics, computer vision, and cognitive science. Most existing works study grasping from the perspective of synthesizing hand poses conditioned on 3D or 2D object representations. We propose GenHeld to address the inverse problem of synthesizing held objects conditioned on 3D hand model or 2D image. Given a 3D model of hand, GenHeld 3D can select a plausible held object from a large dataset using compact object representations called object codes.The selected object is then positioned and oriented to form a plausible grasp without changing hand pose. If only a 2D hand image is available, GenHeld 2D can edit this image to add or replace a held object. GenHeld 2D operates by combining the abilities of GenHeld 3D with diffusion-based image editing. Results and experiments show that we outperform baselines and can generate plausible held objects in both 2D and 3D. Our experiments demonstrate that our method achieves high quality and plausibility of held object synthesis in both 3D and 2D.
Paper Structure (30 sections, 15 equations, 17 figures, 4 tables)

This paper contains 30 sections, 15 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: We present GenHeld, a model to synthesize held objects given 3D hand model or 2D hand image. GenHeld3D can select plausible and diverse objects from a large object repository objaverse, while GenHeld2D can add or replace existing held objects in images.
  • Figure 2: Stable Diffusion stable_diffusion struggles to edit images of hands holding objects. For an identity inpainting task (purple region), it is unable to faithfully reconstruct the hand or held object.
  • Figure 3: GenHeld3D can synthesize a 3D held object given a 3D hand model as input (top). We encode the 3D hand model to estimate object codes that act as a compact representation of plausible held objects. These object codes can be used to retrieve diver objects from a much larger dataset like Objaverse objaverse. This is followed by an object fitting step to position and orient objects to form the grasp without changing the initial hand pose.
  • Figure 4: We use object codes -- compact representations of object shapes learned from a real dataset of grasps. These codes can be used to find suitable diverse objects in a larger object dataset like Objaverse. Our method can also handle scale variations by normalizing using principal bone length $b$.
  • Figure 5: GenHeld2D enables us to add or replace held objects to 2D hand images. We do this by first lifting hand images to 3D hand and object using GenHeld3D. This is then following by 2D keypoint projection and alignment to create a 3D guidance image that is used to edit the image.
  • ...and 12 more figures