Gaze-guided Hand-Object Interaction Synthesis: Dataset and Method

Jie Tian; Ran Ji; Lingxiao Yang; Suting Ni; Yuexin Ma; Lan Xu; Jingyi Yu; Ye Shi; Jingya Wang

Gaze-guided Hand-Object Interaction Synthesis: Dataset and Method

Jie Tian, Ran Ji, Lingxiao Yang, Suting Ni, Yuexin Ma, Lan Xu, Jingyi Yu, Ye Shi, Jingya Wang

TL;DR

This work tackles generating natural hand-object interactions guided by gaze. It introduces GazeHOI, a dataset with synchronized 3D gaze, hand, and object data, and presents GHO-Diffusion, a stacked diffusion pipeline that first synthesizes object dynamics conditioned on spatial-temporal gaze features and then synthesizes hand kinematics under HOI constraints, aided by HOI-Manifold Guidance. The approach includes a gaze-driven feature encoding, a consistency-based motion selection, and extensive ablations showing gains over baselines on seen/unseen objects, contact/penetration realism, and gaze alignment. The results demonstrate the practicality of gaze-guided synthesis for AR/VR and assistive technologies, while also outlining limitations (e.g., absence of articulated objects) and future directions, such as incorporating additional modalities for richer context.

Abstract

Gaze plays a crucial role in revealing human attention and intention, particularly in hand-object interaction scenarios, where it guides and synchronizes complex tasks that require precise coordination between the brain, hand, and object. Motivated by this, we introduce a novel task: Gaze-Guided Hand-Object Interaction Synthesis, with potential applications in augmented reality, virtual reality, and assistive technologies. To support this task, we present GazeHOI, the first dataset to capture simultaneous 3D modeling of gaze, hand, and object interactions. This task poses significant challenges due to the inherent sparsity and noise in gaze data, as well as the need for high consistency and physical plausibility in generating hand and object motions. To tackle these issues, we propose a stacked gaze-guided hand-object interaction diffusion model, named GHO-Diffusion. The stacked design effectively reduces the complexity of motion generation. We also introduce HOI-Manifold Guidance during the sampling stage of GHO-Diffusion, enabling fine-grained control over generated motions while maintaining the data manifold. Additionally, we propose a spatial-temporal gaze feature encoding for the diffusion condition and select diffusion results based on consistency scores between gaze-contact maps and gaze-interaction trajectories. Extensive experiments highlight the effectiveness of our method and the unique contributions of our dataset. More details in https://takiee.github.io/gaze-hoi/.

Gaze-guided Hand-Object Interaction Synthesis: Dataset and Method

TL;DR

Abstract

Paper Structure (38 sections, 23 equations, 3 figures, 8 tables)

This paper contains 38 sections, 23 equations, 3 figures, 8 tables.

Introduction
Related Work
Gaze for Perception
Hand-Object Interaction Dataset
Hand-Object Motion Synthesis
GazeHOI Dataset
Dataset Hardware Setup
Data Collection
Data Annotation
3D Hand Pose
Object 6D Pose
Gaze Acquisition
Method
Problem Definition
GHO-Diffusion: Gaze-guided Hand-Object Diffusion
...and 23 more sections

Figures (3)

Figure 2: Automatic data processing pipeline. (a) displays 12-view images from the raw video. (b) uses mediapipemediapipe to get hand 2D joints and triangulation to obtain the 3D joints. (c) shows the ego view with a gaze point. (d) illustrates the objects acquired by the scanner and the marker tracking process. (e) shows the result of hand-object motion.
Figure 3: Pipeline overview. The subfigure (a) illustrates the stacked framework of the gaze-guided hand-object interaction diffusion model, GHO-Diffusion. The subfigure (b) shows the gaze-interaction consistency score for motion selection.
Figure 4: Qualitative results between baseline methods and our method.

Gaze-guided Hand-Object Interaction Synthesis: Dataset and Method

TL;DR

Abstract

Gaze-guided Hand-Object Interaction Synthesis: Dataset and Method

Authors

TL;DR

Abstract

Table of Contents

Figures (3)