Table of Contents
Fetching ...

SIGHT: Synthesizing Image-Text Conditioned and Geometry-Guided 3D Hand-Object Trajectories

Alexey Gavryushin, Alexandros Delitzas, Luc Van Gool, Marc Pollefeys, Kaichun Mo, Xi Wang

TL;DR

SIGHT addresses the challenge of generating realistic 3D hand–object trajectories from a single image and a brief task description. It introduces SIGHT-Fusion, a diffusion-based motion generator conditioned on a wrist-centered visual crop and text, augmented with retrieval-based 3D mesh guidance and an inference-time interpenetration penalty to enforce plausible contacts. The model is trained with a velocity loss and DDPM-style reconstruction loss, and evaluated on HOI4D and H2O, showing improved diversity, realism, and physical plausibility over adapted baselines, with ablations confirming the value of multi-modal conditioning and geometry-guided guidance. This work advances image-to-3D hand–object trajectory synthesis with potential applications in robotics, animation, and action understanding, and it provides code and models to facilitate further research.

Abstract

When humans grasp an object, they naturally form trajectories in their minds to manipulate it for specific tasks. Modeling hand-object interaction priors holds significant potential to advance robotic and embodied AI systems in learning to operate effectively within the physical world. We introduce SIGHT, a novel task focused on generating realistic and physically plausible 3D hand-object interaction trajectories from a single image and a brief language-based task description. Prior work on hand-object trajectory generation typically relies on textual input that lacks explicit grounding to the target object, or assumes access to 3D object meshes, which are often considerably more difficult to obtain than 2D images. We propose SIGHT-Fusion, a novel diffusion-based image-text conditioned generative model that tackles this task by retrieving the most similar 3D object mesh from a database and enforcing geometric hand-object interaction constraints via a novel inference-time diffusion guidance. We benchmark our model on the HOI4D and H2O datasets, adapting relevant baselines for this novel task. Experiments demonstrate our superior performance in the diversity and quality of generated trajectories, as well as in hand-object interaction geometry metrics.

SIGHT: Synthesizing Image-Text Conditioned and Geometry-Guided 3D Hand-Object Trajectories

TL;DR

SIGHT addresses the challenge of generating realistic 3D hand–object trajectories from a single image and a brief task description. It introduces SIGHT-Fusion, a diffusion-based motion generator conditioned on a wrist-centered visual crop and text, augmented with retrieval-based 3D mesh guidance and an inference-time interpenetration penalty to enforce plausible contacts. The model is trained with a velocity loss and DDPM-style reconstruction loss, and evaluated on HOI4D and H2O, showing improved diversity, realism, and physical plausibility over adapted baselines, with ablations confirming the value of multi-modal conditioning and geometry-guided guidance. This work advances image-to-3D hand–object trajectory synthesis with potential applications in robotics, animation, and action understanding, and it provides code and models to facilitate further research.

Abstract

When humans grasp an object, they naturally form trajectories in their minds to manipulate it for specific tasks. Modeling hand-object interaction priors holds significant potential to advance robotic and embodied AI systems in learning to operate effectively within the physical world. We introduce SIGHT, a novel task focused on generating realistic and physically plausible 3D hand-object interaction trajectories from a single image and a brief language-based task description. Prior work on hand-object trajectory generation typically relies on textual input that lacks explicit grounding to the target object, or assumes access to 3D object meshes, which are often considerably more difficult to obtain than 2D images. We propose SIGHT-Fusion, a novel diffusion-based image-text conditioned generative model that tackles this task by retrieving the most similar 3D object mesh from a database and enforcing geometric hand-object interaction constraints via a novel inference-time diffusion guidance. We benchmark our model on the HOI4D and H2O datasets, adapting relevant baselines for this novel task. Experiments demonstrate our superior performance in the diversity and quality of generated trajectories, as well as in hand-object interaction geometry metrics.

Paper Structure

This paper contains 18 sections, 3 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: The proposed SIGHT task and SIGHT-Fusion method. Given an input image showing an object being interacted with by a single hand or two hands as well as a task description, the SIGHT task is to generate realistic and physically plausible hand-object motion sequences mapping out possible trajectories of the hand(s) when interacting with the object as shown. We propose a diffusion-based image-text conditioned method SIGHT-Fusion, which integrates a novel retrieval mechanism and inference-time guidance strategy to generate realistic and physically plausible 3D hand-object interaction trajectories.
  • Figure 2: An Overview of SIGHT-Fusion. Given an input image, we first crop it to a region centered around the hand and the interacted object by detecting the wrist position. We then extract both textual and visual features from the input. These features are passed to a diffusion-based motion generator, which synthesizes realistic and task-appropriate 3D hand-object interaction trajectories. Additionally, we use the visual feature along with the task description to retrieve a corresponding 3D object, which serves as interpenetration guidance during inference.
  • Figure 3: Baseline Comparisons. We provide qualitative comparisons of our method's synthesized motions (top row) to those of ReMoDiffuse (middle row) and MDM (bottom row), on objects from H2O (left, right) and HOI4D (center). The baselines exhibit interpenetration artifacts in the generated sequences, whereas our method produces realistic trajectories.
  • Figure S4: Further examples of hand-object interaction trajectories generated by our method, as viewed from different perspectives. Video files with further visualizations are also available as part of the Supp. Mat.
  • Figure S5: Conditioning images used to generate the trajectories in \ref{['fig:supp_rollouts']} (top to bottom). The associated action labels are read book, pour bottle, open milk.