Table of Contents
Fetching ...

GraspDiffusion: Synthesizing Realistic Whole-body Hand-Object Interaction

Patrick Kwon, Chen Chen, Hanbyul Joo

TL;DR

GraspDiffusion tackles realistic full-body hand–object interaction generation by first predicting a 3D full-body grasp pose conditioned on an object, then guiding high-quality image synthesis with a scene-generation diffusion that enforces accurate spatial relations and human identity. The method decouples body and hand priors, uses three spatial cues with attention-based conditioning, and leverages a curated pseudo-3D HOI dataset to train and evaluate the pipeline against baselines. Quantitative and qualitative results show improved image fidelity, pose plausibility, and interaction realism, with demonstrated applicability to diverse object inputs and artistic styles. Limitations include texture inconsistencies and single-object focus, pointing to future work on multi-person scenes, text-controllable prompts, and video-style HOI synthesis.

Abstract

Recent generative models can synthesize high-quality images, but they often fail to generate humans interacting with objects using their hands. This arises mostly from the model's misunderstanding of such interactions and the hardships of synthesizing intricate regions of the body. In this paper, we propose \textbf{GraspDiffusion}, a novel generative method that creates realistic scenes of human-object interaction. Given a 3D object, GraspDiffusion constructs whole-body poses with control over the object's location relative to the human body, which is achieved by separately leveraging the generative priors for body and hand poses, optimizing them into a joint grasping pose. This pose guides the image synthesis to correctly reflect the intended interaction, creating realistic and diverse human-object interaction scenes. We demonstrate that GraspDiffusion can successfully tackle the relatively uninvestigated problem of generating full-bodied human-object interactions while outperforming previous methods. Our project page is available at https://yj7082126.github.io/graspdiffusion/

GraspDiffusion: Synthesizing Realistic Whole-body Hand-Object Interaction

TL;DR

GraspDiffusion tackles realistic full-body hand–object interaction generation by first predicting a 3D full-body grasp pose conditioned on an object, then guiding high-quality image synthesis with a scene-generation diffusion that enforces accurate spatial relations and human identity. The method decouples body and hand priors, uses three spatial cues with attention-based conditioning, and leverages a curated pseudo-3D HOI dataset to train and evaluate the pipeline against baselines. Quantitative and qualitative results show improved image fidelity, pose plausibility, and interaction realism, with demonstrated applicability to diverse object inputs and artistic styles. Limitations include texture inconsistencies and single-object focus, pointing to future work on multi-person scenes, text-controllable prompts, and video-style HOI synthesis.

Abstract

Recent generative models can synthesize high-quality images, but they often fail to generate humans interacting with objects using their hands. This arises mostly from the model's misunderstanding of such interactions and the hardships of synthesizing intricate regions of the body. In this paper, we propose \textbf{GraspDiffusion}, a novel generative method that creates realistic scenes of human-object interaction. Given a 3D object, GraspDiffusion constructs whole-body poses with control over the object's location relative to the human body, which is achieved by separately leveraging the generative priors for body and hand poses, optimizing them into a joint grasping pose. This pose guides the image synthesis to correctly reflect the intended interaction, creating realistic and diverse human-object interaction scenes. We demonstrate that GraspDiffusion can successfully tackle the relatively uninvestigated problem of generating full-bodied human-object interactions while outperforming previous methods. Our project page is available at https://yj7082126.github.io/graspdiffusion/

Paper Structure

This paper contains 15 sections, 5 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Given an object mesh and its relative position, GraspDiffusion generates whole body grasping 3D poses, which is subsequently used as guidance for creating human-object interaction scenes. As shown, GraspDiffusion can synthesize images with valid human-object interactions for various types of objects. Note that the bottom-right sample (a green bag) was created from an object image, which was made into a 3D using TripoSR TripoSR2024, further paving the way for various use cases.
  • Figure 2: Comparison between our method and previous approaches on generating HOI images. While previous methods can generate images conditioned on human pose and refine hand shapes, they are prone to erroneous object creation (top row) or faulty interaction synthesis (bottom row).
  • Figure 3: We present a two-stage pipeline to generate realistic human-object-interaction images. The first stage takes a single object model and its human-centric location to synthesize a 3D full-bodied grasping pose, providing scene-level context for image generation. The second stage takes reference from the 3D grasping pose, conditionally generating high-quality images.
  • Figure 4: Full-body grasping pipeline. We separately leverage a hand-grasping model Taheri2020GRABAD and a body-pose diffusion model, and perform a joint optimization into a full-bodied grasping pose.
  • Figure 5: Scene generation stage. We inject three image conditions and semantic segmentation images as guidance for the generation of a high-quality HOI image. We then use the same types of renderings centered on the hand-object region to refine the hand quality.
  • ...and 8 more figures