Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation
Haodong Yan, Hang Yu, Zhide Zhong, Weilin Yuan, Xin Gong, Zehang Luo, Chengxi Heyu, Junfeng Li, Wenxuan Song, Shunbo Zhou, Haoang Li
TL;DR
This work tackles open-world hand-object interaction video generation by introducing a structure- and contact-aware representation composed of contact-augmented hand-object contours and depth maps, trained without 3D annotations. A joint-generation paradigm with a hierarchical joint denoiser (shared semantics and specialized details) enables simultaneous synthesis of HOI representations and videos, mitigating multi-stage error accumulation. The approach is validated on Taste-Rob and Taco, outperforming state-of-the-art methods in physics realism and temporal coherence and showing strong generalization to unseen objects. The authors also demonstrate the scalability and effectiveness of their representation through large-scale curation (>100k HOI videos) and thorough ablations. Overall, the method advances HOI video generation by uniting scalable structure cues with contact semantics under a unified diffusion-based framework, enabling robust open-world performance.
Abstract
Generating realistic hand-object interactions (HOI) videos is a significant challenge due to the difficulty of modeling physical constraints (e.g., contact and occlusion between hands and manipulated objects). Current methods utilize HOI representation as an auxiliary generative objective to guide video synthesis. However, there is a dilemma between 2D and 3D representations that cannot simultaneously guarantee scalability and interaction fidelity. To address this limitation, we propose a structure and contact-aware representation that captures hand-object contact, hand-object occlusion, and holistic structure context without 3D annotations. This interaction-oriented and scalable supervision signal enables the model to learn fine-grained interaction physics and generalize to open-world scenarios. To fully exploit the proposed representation, we introduce a joint-generation paradigm with a share-and-specialization strategy that generates interaction-oriented representations and videos. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on two real-world datasets in generating physics-realistic and temporally coherent HOI videos. Furthermore, our approach exhibits strong generalization to challenging open-world scenarios, highlighting the benefit of our scalable design. Our project page is https://hgzn258.github.io/SCAR/.
