Table of Contents
Fetching ...

HOIDiNi: Human-Object Interaction through Diffusion Noise Optimization

Roey Ron, Guy Tevet, Haim Sawdayee, Amit H. Bermano

TL;DR

HOIDiNi tackles the challenge of generating realistic yet contact-accurate human-object interactions from text prompts. It introduces a joint diffusion model, CPHOI, that predicts semantically meaningful hand–object contact pairs and coordinated full-body motion, and employs a two-phase Diffusion Noise Optimization ($DNO$) to enforce contact constraints without leaving the learned motion manifold. Through quantitative metrics and user studies on the GRAB and OMOMO datasets, HOIDiNi demonstrates improvements in contact precision, physical validity, and overall motion realism, including complex actions like grasping and placing. This work advances controllable, high-fidelity HOI generation and offers a scalable diffusion-based framework that integrates object geometry, contact semantics, and motion for text-driven synthesis.

Abstract

We present HOIDiNi, a text-driven diffusion framework for synthesizing realistic and plausible human-object interaction (HOI). HOI generation is extremely challenging since it induces strict contact accuracies alongside a diverse motion manifold. While current literature trades off between realism and physical correctness, HOIDiNi optimizes directly in the noise space of a pretrained diffusion model using Diffusion Noise Optimization (DNO), achieving both. This is made feasible thanks to our observation that the problem can be separated into two phases: an object-centric phase, primarily making discrete choices of hand-object contact locations, and a human-centric phase that refines the full-body motion to realize this blueprint. This structured approach allows for precise hand-object contact without compromising motion naturalness. Quantitative, qualitative, and subjective evaluations on the GRAB dataset alone clearly indicate HOIDiNi outperforms prior works and baselines in contact accuracy, physical validity, and overall quality. Our results demonstrate the ability to generate complex, controllable interactions, including grasping, placing, and full-body coordination, driven solely by textual prompts. https://hoidini.github.io.

HOIDiNi: Human-Object Interaction through Diffusion Noise Optimization

TL;DR

HOIDiNi tackles the challenge of generating realistic yet contact-accurate human-object interactions from text prompts. It introduces a joint diffusion model, CPHOI, that predicts semantically meaningful hand–object contact pairs and coordinated full-body motion, and employs a two-phase Diffusion Noise Optimization () to enforce contact constraints without leaving the learned motion manifold. Through quantitative metrics and user studies on the GRAB and OMOMO datasets, HOIDiNi demonstrates improvements in contact precision, physical validity, and overall motion realism, including complex actions like grasping and placing. This work advances controllable, high-fidelity HOI generation and offers a scalable diffusion-based framework that integrates object geometry, contact semantics, and motion for text-driven synthesis.

Abstract

We present HOIDiNi, a text-driven diffusion framework for synthesizing realistic and plausible human-object interaction (HOI). HOI generation is extremely challenging since it induces strict contact accuracies alongside a diverse motion manifold. While current literature trades off between realism and physical correctness, HOIDiNi optimizes directly in the noise space of a pretrained diffusion model using Diffusion Noise Optimization (DNO), achieving both. This is made feasible thanks to our observation that the problem can be separated into two phases: an object-centric phase, primarily making discrete choices of hand-object contact locations, and a human-centric phase that refines the full-body motion to realize this blueprint. This structured approach allows for precise hand-object contact without compromising motion naturalness. Quantitative, qualitative, and subjective evaluations on the GRAB dataset alone clearly indicate HOIDiNi outperforms prior works and baselines in contact accuracy, physical validity, and overall quality. Our results demonstrate the ability to generate complex, controllable interactions, including grasping, placing, and full-body coordination, driven solely by textual prompts. https://hoidini.github.io.

Paper Structure

This paper contains 29 sections, 10 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: HOIDiNi generates human-object interactions from text descriptions and object geometry, integrated here into a 3D scene from blendswapBlendSwap.
  • Figure 2: System Overview. HOIDiNi generates Human-object Interaction (HOI) motions according to a text prompt, a mesh describing the object, and the occupied volume in the scene, by optimizing the diffusion noise. The Object-Centric Phase generates the object motion and its contact points with the hands ($CP$ and $O$), then the Human-Centric Phase follows and generates the full human motion($H$): body and fingers, adhering to the constraints implied by the previous phase. Both phases use CPHOI, a pre-trained diffusion model that learned the human-object joint distribution. We apply Diffusion Noise Optimization (DNO) karunratanakul2024optimizing to fulfill the two sets of loss functions ($\mathcal{L}_{\text{Object}}$ and $\mathcal{L}_{\text{Human}}$) without deviating from the learned distribution.
  • Figure 3: Contact Pairs. CPHOI predicts precise, semantically meaningful contact points between the hand and object. Each contact pair is visualized with matching colored spheres.
  • Figure 4: CPHOI Diffusion Model. CPHOI autoregressively predicts the next motion segment $s^n$ from the previous one $s^{n-1}$. The figure illustrates a single diffusion step, where the model denoises $s_t^n$ to predict $\hat{s}_0^n$. It jointly generates human and object motions, along with dynamic contact points, conditioned on the object’s geometry and a text description of the interaction.
  • Figure 5: Qualitative Results of human-object interactions generated by our method across diverse prompts. For instance, “taking a picture with a camera” yields a semantically appropriate two-handed pose. Motions are both visually plausible and aligned with the prompts.
  • ...and 5 more figures