Table of Contents
Fetching ...

DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary

Jiazhi Guan, Quanwei Yang, Luying Huang, Junhao Liang, Borong Liang, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou, Jingdong Wang

TL;DR

A framework, namely DISPLAY, guided by Sparse Motion Guidance, composed only of wrist joint coordinates and a shape-agnostic object bounding box is introduced, which alleviates the imbalance between human and object representations and enables intuitive user control.

Abstract

Human-centric video generation has advanced rapidly, yet existing methods struggle to produce controllable and physically consistent Human-Object Interaction (HOI) videos. Existing works rely on dense control signals, template videos, or carefully crafted text prompts, which limit flexibility and generalization to novel objects. We introduce a framework, namely DISPLAY, guided by Sparse Motion Guidance, composed only of wrist joint coordinates and a shape-agnostic object bounding box. This lightweight guidance alleviates the imbalance between human and object representations and enables intuitive user control. To enhance fidelity under such sparse conditions, we propose an Object-Stressed Attention mechanism that improves object robustness. To address the scarcity of high-quality HOI data, we further develop a Multi-Task Auxiliary Training strategy with a dedicated data curation pipeline, allowing the model to benefit from both reliable HOI samples and auxiliary tasks. Comprehensive experiments show that our method achieves high-fidelity, controllable HOI generation across diverse tasks. The project page can be found at \href{https://mumuwei.github.io/DISPLAY/}.

DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary

TL;DR

A framework, namely DISPLAY, guided by Sparse Motion Guidance, composed only of wrist joint coordinates and a shape-agnostic object bounding box is introduced, which alleviates the imbalance between human and object representations and enables intuitive user control.

Abstract

Human-centric video generation has advanced rapidly, yet existing methods struggle to produce controllable and physically consistent Human-Object Interaction (HOI) videos. Existing works rely on dense control signals, template videos, or carefully crafted text prompts, which limit flexibility and generalization to novel objects. We introduce a framework, namely DISPLAY, guided by Sparse Motion Guidance, composed only of wrist joint coordinates and a shape-agnostic object bounding box. This lightweight guidance alleviates the imbalance between human and object representations and enables intuitive user control. To enhance fidelity under such sparse conditions, we propose an Object-Stressed Attention mechanism that improves object robustness. To address the scarcity of high-quality HOI data, we further develop a Multi-Task Auxiliary Training strategy with a dedicated data curation pipeline, allowing the model to benefit from both reliable HOI samples and auxiliary tasks. Comprehensive experiments show that our method achieves high-fidelity, controllable HOI generation across diverse tasks. The project page can be found at \href{https://mumuwei.github.io/DISPLAY/}.
Paper Structure (21 sections, 3 equations, 9 figures, 2 tables)

This paper contains 21 sections, 3 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Human-Object Interactions Synthesis. Our method generates user-intended human-object interactions from zero-shot references and user-specified sparse motion guidance. Blue: a generated video performing human-object interactions with novel visual and object references. Gray: a generated video demonstrating environmental interactions conditioned solely on the visual reference.
  • Figure 2: The DISPLAY Framework. We organize the proposed framework into three parts: Input Pipeline, Pretrained Video Model, and Condition Branch. 1) In the Input Pipeline, the input video is processed through a predefined HOI data procedure to extract the required multi-modal conditioning signals. 2) The Pretrained Video Model preserves the original T2V denoising formulation. 3) The Condition Branch encodes the multi-modal conditions and modulates the generation process via residual injection to guide the final video synthesis.
  • Figure 3: Qualitative Comparisons. We present an object-replacement comparison on the left, where the original object is substituted according to the provided object reference. The visual reference and background condition used by our method are displayed in the top row. On the right, we present an object-insertion comparison, where the template video contains no original object.
  • Figure 4: Qualitative Comparisons. Comparisons with AnchorCraft and Re-HOLD on their official results.
  • Figure 5: Beyond Object Replacement. Guided by sparse motions authored via the proposed user interface, our model supports environmental interactions. Red arrows indicate motion patterns we provide.
  • ...and 4 more figures