Gaze-guided Hand-Object Interaction Synthesis: Dataset and Method
Jie Tian, Ran Ji, Lingxiao Yang, Suting Ni, Yuexin Ma, Lan Xu, Jingyi Yu, Ye Shi, Jingya Wang
TL;DR
This work tackles generating natural hand-object interactions guided by gaze. It introduces GazeHOI, a dataset with synchronized 3D gaze, hand, and object data, and presents GHO-Diffusion, a stacked diffusion pipeline that first synthesizes object dynamics conditioned on spatial-temporal gaze features and then synthesizes hand kinematics under HOI constraints, aided by HOI-Manifold Guidance. The approach includes a gaze-driven feature encoding, a consistency-based motion selection, and extensive ablations showing gains over baselines on seen/unseen objects, contact/penetration realism, and gaze alignment. The results demonstrate the practicality of gaze-guided synthesis for AR/VR and assistive technologies, while also outlining limitations (e.g., absence of articulated objects) and future directions, such as incorporating additional modalities for richer context.
Abstract
Gaze plays a crucial role in revealing human attention and intention, particularly in hand-object interaction scenarios, where it guides and synchronizes complex tasks that require precise coordination between the brain, hand, and object. Motivated by this, we introduce a novel task: Gaze-Guided Hand-Object Interaction Synthesis, with potential applications in augmented reality, virtual reality, and assistive technologies. To support this task, we present GazeHOI, the first dataset to capture simultaneous 3D modeling of gaze, hand, and object interactions. This task poses significant challenges due to the inherent sparsity and noise in gaze data, as well as the need for high consistency and physical plausibility in generating hand and object motions. To tackle these issues, we propose a stacked gaze-guided hand-object interaction diffusion model, named GHO-Diffusion. The stacked design effectively reduces the complexity of motion generation. We also introduce HOI-Manifold Guidance during the sampling stage of GHO-Diffusion, enabling fine-grained control over generated motions while maintaining the data manifold. Additionally, we propose a spatial-temporal gaze feature encoding for the diffusion condition and select diffusion results based on consistency scores between gaze-contact maps and gaze-interaction trajectories. Extensive experiments highlight the effectiveness of our method and the unique contributions of our dataset. More details in https://takiee.github.io/gaze-hoi/.
