Table of Contents
Fetching ...

DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions

Sammy Christen, Shreyas Hampali, Fadime Sener, Edoardo Remelli, Tomas Hodan, Eric Sauser, Shugao Ma, Bugra Tekin

TL;DR

DiffH2O introduces a diffusion-based framework to synthesize hand-object interactions from textual descriptions, addressing the scarcity of HOI data and generalization to unseen objects. It decouples HOI generation into grasping and interaction stages with a canonical hand-object representation, and uses subsequence imputing and grasp guidance to improve continuity and controllability. Detailed textual annotations for the GRAB dataset enable fine-grained prompt-driven control, and the method outperforms baselines on physics and motion metrics while generalizing to new objects. This work advances synthetic HOI data generation for applications in animation, VR, and robotics, enabling scalable, controllable HOI synthesis from language.

Abstract

Generating natural hand-object interactions in 3D is challenging as the resulting hand and object motions are expected to be physically plausible and semantically meaningful. Furthermore, generalization to unseen objects is hindered by the limited scale of available hand-object interaction datasets. In this paper, we propose a novel method, dubbed DiffH2O, which can synthesize realistic, one or two-handed object interactions from provided text prompts and geometry of the object. The method introduces three techniques that enable effective learning from limited data. First, we decompose the task into a grasping stage and an text-based manipulation stage and use separate diffusion models for each. In the grasping stage, the model only generates hand motions, whereas in the manipulation phase both hand and object poses are synthesized. Second, we propose a compact representation that tightly couples hand and object poses and helps in generating realistic hand-object interactions. Third, we propose two different guidance schemes to allow more control of the generated motions: grasp guidance and detailed textual guidance. Grasp guidance takes a single target grasping pose and guides the diffusion model to reach this grasp at the end of the grasping stage, which provides control over the grasping pose. Given a grasping motion from this stage, multiple different actions can be prompted in the manipulation phase. For the textual guidance, we contribute comprehensive text descriptions to the GRAB dataset and show that they enable our method to have more fine-grained control over hand-object interactions. Our quantitative and qualitative evaluation demonstrates that the proposed method outperforms baseline methods and leads to natural hand-object motions.

DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions

TL;DR

DiffH2O introduces a diffusion-based framework to synthesize hand-object interactions from textual descriptions, addressing the scarcity of HOI data and generalization to unseen objects. It decouples HOI generation into grasping and interaction stages with a canonical hand-object representation, and uses subsequence imputing and grasp guidance to improve continuity and controllability. Detailed textual annotations for the GRAB dataset enable fine-grained prompt-driven control, and the method outperforms baselines on physics and motion metrics while generalizing to new objects. This work advances synthetic HOI data generation for applications in animation, VR, and robotics, enabling scalable, controllable HOI synthesis from language.

Abstract

Generating natural hand-object interactions in 3D is challenging as the resulting hand and object motions are expected to be physically plausible and semantically meaningful. Furthermore, generalization to unseen objects is hindered by the limited scale of available hand-object interaction datasets. In this paper, we propose a novel method, dubbed DiffH2O, which can synthesize realistic, one or two-handed object interactions from provided text prompts and geometry of the object. The method introduces three techniques that enable effective learning from limited data. First, we decompose the task into a grasping stage and an text-based manipulation stage and use separate diffusion models for each. In the grasping stage, the model only generates hand motions, whereas in the manipulation phase both hand and object poses are synthesized. Second, we propose a compact representation that tightly couples hand and object poses and helps in generating realistic hand-object interactions. Third, we propose two different guidance schemes to allow more control of the generated motions: grasp guidance and detailed textual guidance. Grasp guidance takes a single target grasping pose and guides the diffusion model to reach this grasp at the end of the grasping stage, which provides control over the grasping pose. Given a grasping motion from this stage, multiple different actions can be prompted in the manipulation phase. For the textual guidance, we contribute comprehensive text descriptions to the GRAB dataset and show that they enable our method to have more fine-grained control over hand-object interactions. Our quantitative and qualitative evaluation demonstrates that the proposed method outperforms baseline methods and leads to natural hand-object motions.
Paper Structure (56 sections, 7 equations, 5 figures, 10 tables)

This paper contains 56 sections, 7 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Overview of DiffH$_{2}$O. We couple hands and objects by representing hands relative to the object position in the initial frame and encoding hand-object distances (Sec. \ref{['sec:representation']}). We observe that objects are static until they have been grasped, and propose to decouple grasping and interaction stages and modelling them with two different diffusion processes (Sec. \ref{['sec:prepost']}). Finally, we make use of grasp guidance and subsequence imputation to ensure a smooth transition between these two stages (Sec. \ref{['sec:keyframe_guidance']}). We further show fine-grained synthesis controllability through our detailed textual descriptions (Sec. \ref{['meth:tex_aug']}).
  • Figure 2: Qualitative Comparison. Post-optimizing object motion as in IMoS IMoS (bottom row) exhibits artifacts with fine-grained manipulations, e.g., when an object switches hands. In contrast, our approach (top row) seamlessly handles such cases. Best seen in supplemental video.
  • Figure 3: Qualitative Examples. We provide more qualitative examples with a) standard generation without any guidance b) grasp guidance c) our model trained with detailed text descriptions.
  • Figure 4: Failure Cases. We present three possible failure cases of our method. a) The generated motion does not match the action described in the input prompt, such as trying to perform a bottle opening motion with an apple. b) During grasp guidance, the reference grasp is largely ignored in the diffusion process, resulting in an interaction that is distinct from the grasp reference. c) Despite training with our curated text annotations, the model sometimes does not pick up on the cue of handedness and may interact with a hand different from the one provided in the text prompt.
  • Figure 5: Overview of the diffusion architecture. Our pipeline relies on a UNet block and processes three input signals: the time step $\phi(t)$, a text-prompt embedding $\mathcal{T}$ and an object shape encoding $\mathcal{M}$. The time step is encoded using sinusoidal functions, the text-prompt embedding is generated by the CLIP text encoder model and the object encoding is obtained from BPSprokudin2019efficient. Similarly to karunratanakul2023gmd, we use Adaptive Group normalization in 1D block