GenHeld: Generating and Editing Handheld Objects
Chaerin Min, Srinath Sridhar
TL;DR
GenHeld tackles the inverse problem of generating held objects conditioned on a hand input, introducing a 3D pipeline (GenHeld3D) that learns compact object codes to retrieve plausible objects from a large dataset and fits them without altering the hand pose, and a 2D pipeline (GenHeld2D) that edits hand images by leveraging 3D guidance in diffusion-based editing. The object codes link hand pose to diverse graspable geometries, enabling fast, plausible object fitting with contact-aware and physically plausible constraints; GenHeld2D extends this by projecting 3D guidance into 2D image edits using DDIM-based inversion and gradient-guided diffusion with occlusion awareness. Experimental results show improved grasp quality, faster convergence in object fitting, and higher 2D editing plausibility compared with baselines and image-editing methods that lack 3D guidance. The work advances practical hand-object synthesis for VR, robotics, and image editing, while highlighting limitations in speed and open-vocabulary scalability and proposing future safeguards for misuse.
Abstract
Grasping is an important human activity that has long been studied in robotics, computer vision, and cognitive science. Most existing works study grasping from the perspective of synthesizing hand poses conditioned on 3D or 2D object representations. We propose GenHeld to address the inverse problem of synthesizing held objects conditioned on 3D hand model or 2D image. Given a 3D model of hand, GenHeld 3D can select a plausible held object from a large dataset using compact object representations called object codes.The selected object is then positioned and oriented to form a plausible grasp without changing hand pose. If only a 2D hand image is available, GenHeld 2D can edit this image to add or replace a held object. GenHeld 2D operates by combining the abilities of GenHeld 3D with diffusion-based image editing. Results and experiments show that we outperform baselines and can generate plausible held objects in both 2D and 3D. Our experiments demonstrate that our method achieves high quality and plausibility of held object synthesis in both 3D and 2D.
