GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection

Xuan Huang; Mochu Xiang; Zhelun Shen; Jinbo Wu; Chenming Wu; Chen Zhao; Kaisiyuan Wang; Hang Zhou; Shanshan Liu; Haocheng Feng; Wei He; Jingdong Wang

GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection

Xuan Huang, Mochu Xiang, Zhelun Shen, Jinbo Wu, Chenming Wu, Chen Zhao, Kaisiyuan Wang, Hang Zhou, Shanshan Liu, Haocheng Feng, Wei He, Jingdong Wang

TL;DR

GenHOI is presented, a lightweight augmentation to pretrained video generation models that injects reference-object information in a temporally balanced and spatially selective manner and significantly outperforms state-of-the-art HOI reenactment and all-in-one video editing methods.

Abstract

Hand-Object Interaction (HOI) remains a core challenge in digital human video synthesis, where models must generate physically plausible contact and preserve object identity across frames. Although recent HOI reenactment approaches have achieved progress, they are typically trained and evaluated in-domain and fail to generalize to complex, in-the-wild scenarios. In contrast, all-in-one video editing models exhibit broader robustness but still struggle with HOI-specific issues such as inconsistent object appearance. In this paper, we present GenHOI, a lightweight augmentation to pretrained video generation models that injects reference-object information in a temporally balanced and spatially selective manner. For temporal balancing, we propose Head-Sliding RoPE, which assigns head-specific temporal offsets to reference tokens, distributing their influence evenly across frames and mitigating the temporal decay of 3D RoPE to improve long-range object consistency. For spatial selectivity, we design a two-level spatial attention gate that concentrates object-conditioned attention on HOI regions and adaptively scales its strength, preserving background realism while enhancing interaction fidelity. Extensive qualitative and quantitative evaluations on unseen, in-the-wild scenes demonstrate that GenHOI significantly outperforms state-of-the-art HOI reenactment and all-in-one video editing methods. Project page: https://xuanhuang0.github.io/GenHOI/

GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection

TL;DR

Abstract

Paper Structure (19 sections, 10 equations, 8 figures, 2 tables)

This paper contains 19 sections, 10 equations, 8 figures, 2 tables.

Introduction
Related Work
Human Body Animation
Diffusion-based Video Editing and Inpainting
Human-Object Interaction
Proposed Method
Overview
HOI Condition Unit
Temporally Balanced, Spatially Selective Attention
Head-Sliding RoPE
Spatial Attention Gate
Experimental Results
Implementation Details
Evaluation Setting
Comparison with other methods
...and 4 more sections

Figures (8)

Figure 1: Comparison with representative method. All-in-one video editing models (e.g., VACE) benefit from large-scale Internet training data, they still struggle to maintain object consistency across frames. In contrast, our method achieves both strong generalization and natural, visually consistent interactions between the human and the object.
Figure 2: Overview of the proposed framework. The model integrates the HOI Condition Unit, Head-Sliding RoPE, and Spatial Attention Gate for temporally balanced and spatially selective HOI reenactment. HMG denotes hard mask gate and SFG is the soft flow gate.
Figure 3: Visualization comparison between different reference object injection methods: (a) only HOI condition unit (HCU); (b) HCU + ref-in-bbox conditioning; (c) HCU + the proposed Attention.
Figure 4: Visualization of the applied hard mask gate and the resulting attention maps. Multiple attention maps across different heads are shown, highlighting the pronounced effect of the proposed mechanism. The red box indicates the interaction between queries in HOI regions and keys from both video and reference object tokens.
Figure 5: Up: Qualitative comparison with state-of-the-art methods. Down: Cross-reenactment results of the proposed method on in-the-wild videos, demonstrating robust and flexible object reenactment across various shapes, sizes, and categories.
...and 3 more figures

GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection

TL;DR

Abstract

GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection

Authors

TL;DR

Abstract

Table of Contents

Figures (8)