Table of Contents
Fetching ...

Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation

Haodong Yan, Hang Yu, Zhide Zhong, Weilin Yuan, Xin Gong, Zehang Luo, Chengxi Heyu, Junfeng Li, Wenxuan Song, Shunbo Zhou, Haoang Li

TL;DR

This work tackles open-world hand-object interaction video generation by introducing a structure- and contact-aware representation composed of contact-augmented hand-object contours and depth maps, trained without 3D annotations. A joint-generation paradigm with a hierarchical joint denoiser (shared semantics and specialized details) enables simultaneous synthesis of HOI representations and videos, mitigating multi-stage error accumulation. The approach is validated on Taste-Rob and Taco, outperforming state-of-the-art methods in physics realism and temporal coherence and showing strong generalization to unseen objects. The authors also demonstrate the scalability and effectiveness of their representation through large-scale curation (>100k HOI videos) and thorough ablations. Overall, the method advances HOI video generation by uniting scalable structure cues with contact semantics under a unified diffusion-based framework, enabling robust open-world performance.

Abstract

Generating realistic hand-object interactions (HOI) videos is a significant challenge due to the difficulty of modeling physical constraints (e.g., contact and occlusion between hands and manipulated objects). Current methods utilize HOI representation as an auxiliary generative objective to guide video synthesis. However, there is a dilemma between 2D and 3D representations that cannot simultaneously guarantee scalability and interaction fidelity. To address this limitation, we propose a structure and contact-aware representation that captures hand-object contact, hand-object occlusion, and holistic structure context without 3D annotations. This interaction-oriented and scalable supervision signal enables the model to learn fine-grained interaction physics and generalize to open-world scenarios. To fully exploit the proposed representation, we introduce a joint-generation paradigm with a share-and-specialization strategy that generates interaction-oriented representations and videos. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on two real-world datasets in generating physics-realistic and temporally coherent HOI videos. Furthermore, our approach exhibits strong generalization to challenging open-world scenarios, highlighting the benefit of our scalable design. Our project page is https://hgzn258.github.io/SCAR/.

Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation

TL;DR

This work tackles open-world hand-object interaction video generation by introducing a structure- and contact-aware representation composed of contact-augmented hand-object contours and depth maps, trained without 3D annotations. A joint-generation paradigm with a hierarchical joint denoiser (shared semantics and specialized details) enables simultaneous synthesis of HOI representations and videos, mitigating multi-stage error accumulation. The approach is validated on Taste-Rob and Taco, outperforming state-of-the-art methods in physics realism and temporal coherence and showing strong generalization to unseen objects. The authors also demonstrate the scalability and effectiveness of their representation through large-scale curation (>100k HOI videos) and thorough ablations. Overall, the method advances HOI video generation by uniting scalable structure cues with contact semantics under a unified diffusion-based framework, enabling robust open-world performance.

Abstract

Generating realistic hand-object interactions (HOI) videos is a significant challenge due to the difficulty of modeling physical constraints (e.g., contact and occlusion between hands and manipulated objects). Current methods utilize HOI representation as an auxiliary generative objective to guide video synthesis. However, there is a dilemma between 2D and 3D representations that cannot simultaneously guarantee scalability and interaction fidelity. To address this limitation, we propose a structure and contact-aware representation that captures hand-object contact, hand-object occlusion, and holistic structure context without 3D annotations. This interaction-oriented and scalable supervision signal enables the model to learn fine-grained interaction physics and generalize to open-world scenarios. To fully exploit the proposed representation, we introduce a joint-generation paradigm with a share-and-specialization strategy that generates interaction-oriented representations and videos. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on two real-world datasets in generating physics-realistic and temporally coherent HOI videos. Furthermore, our approach exhibits strong generalization to challenging open-world scenarios, highlighting the benefit of our scalable design. Our project page is https://hgzn258.github.io/SCAR/.

Paper Structure

This paper contains 13 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of our structure and contact-aware representation and its enabled open-world generalization. (a) Prior HOI representations lack either scalability (e.g., 3D mesh) or crucial contact/structure cues (e.g., optical flow or segmentation). Our approach resolves this dilemma with a representation composed of two scalable and complementary components: 1) contact-augmented hand-object contours for capturing the contact region and hand-object spatial localization, and 2) depth maps offering holistic structure context. (b) Our structure and contact-aware representation acts as an additional interaction-oriented generative supervision signal. By learning to jointly generate videos and our representation at a large scale, our model captures interaction patterns consistent with physical constraints, enabling strong generalization to complex open-world interactions, even with unseen non-rigid objects.
  • Figure 2: Overview of our structure and contact-aware representation curation pipeline. It begins with (a) Segmentation Extraction, where a CoT-guided VLM grounds hand and object from the input RGB video, and SAM2 generates HOI masks. Next, (b) Contact Region Estimation produces the final contact-augmented hand-object contours by computing a contact region from the intersection of the dilated hand and object contours. In parallel, (c) Video Depth Estimation generates a dense depth map sequence for holistic structure. Finally, these contact-augmented hand-object contours are alpha-blended onto the depth maps to form the final HOI representation.
  • Figure 3: The joint-generation paradigm of our method. Given an observed image and a task description, our framework jointly generates a video and its corresponding HOI representation. The core technical novelty lies in the Hierarchical Joint Denoiser that co-denoises visual and interaction tokens within a unified latent space. First, the Shared Semantics module enforces cross-modal consistency via an alignment loss (maximizing cosine similarity) to capture shared semantics like spatial layout and temporal dynamics. Then, the Specialized Details module adds a learnable interaction embedding to capture modality-specific details. Finally, the denoised predicted visual and interaction tokens are passed through the VAE decoder to reconstruct both outputs.
  • Figure 4: Qualitative comparison with state-of-the-art methods on Taco liu2024taco dataset.CogVideoXyang2024cogvideox produces distorted hands and implausible contact. Wan2.1wan2025wan fails to generate the semantically correct action described in the task description. The two-stage FLOVDjin2025flovd suffers from error propagation, where inaccurate initial optical flow results in hallucination (a red object suddenly appearing). In contrast, our SCAR generates physics-realistic, temporally coherent videos by jointly generating our proposed HOI representation. Please refer to the supplementary video for better illustration. Experimental results on the Taste-Rob zhao2025taste are also available in the supplementary material.
  • Figure 5: Qualitative comparison of generated HOI representations corresponding to \ref{['fig:qualitative_comparison']}. The optical flow generated by FLOVDjin2025flovd is noisy and inaccurate, which leads to the error propagation seen in the final video. In contrast, our jointly generated representation embodies consistent structural and contact cues, indicating that the model captures physical interaction patterns.
  • ...and 1 more figures