Table of Contents
Fetching ...

VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification

Wanyue Zhang, Lin Geng Foo, Thabo Beeler, Rishabh Dabral, Christian Theobalt

TL;DR

This work tackles controllable video generation of human-object interactions by bridging sparse user cues and dense motion signals. It introduces VHOI, a two-stage approach that densifies sparse trajectories into HOI masks with an Augmentor, then synthesizes HOI videos with a Dense Control Model conditioned on these masks. The method uses HOI-aware motion representations, including a part-aware color palette and gating mechanisms, to achieve robust, instance-aware HOI dynamics and navigation-before-interaction sequences, achieving state-of-the-art results across multiple HOI benchmarks. The approach offers practical benefits for animation workflows and synthetic data generation in robotics, while highlighting areas for future improvement in identity preservation and 3D awareness.

Abstract

Synthesizing realistic human-object interactions (HOI) in video is challenging due to the complex, instance-specific interaction dynamics of both humans and objects. Incorporating controllability in video generation further adds to the complexity. Existing controllable video generation approaches face a trade-off: sparse controls like keypoint trajectories are easy to specify but lack instance-awareness, while dense signals such as optical flow, depths or 3D meshes are informative but costly to obtain. We propose VHOI, a two-stage framework that first densifies sparse trajectories into HOI mask sequences, and then fine-tunes a video diffusion model conditioned on these dense masks. We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics. This design incorporates a human prior into the conditioning signal and strengthens the model's ability to understand and generate realistic HOI dynamics. Experiments demonstrate state-of-the-art results in controllable HOI video generation. VHOI is not limited to interaction-only scenarios and can also generate full human navigation leading up to object interactions in an end-to-end manner. Project page: https://vcai.mpi-inf.mpg.de/projects/vhoi/.

VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification

TL;DR

This work tackles controllable video generation of human-object interactions by bridging sparse user cues and dense motion signals. It introduces VHOI, a two-stage approach that densifies sparse trajectories into HOI masks with an Augmentor, then synthesizes HOI videos with a Dense Control Model conditioned on these masks. The method uses HOI-aware motion representations, including a part-aware color palette and gating mechanisms, to achieve robust, instance-aware HOI dynamics and navigation-before-interaction sequences, achieving state-of-the-art results across multiple HOI benchmarks. The approach offers practical benefits for animation workflows and synthetic data generation in robotics, while highlighting areas for future improvement in identity preservation and 3D awareness.

Abstract

Synthesizing realistic human-object interactions (HOI) in video is challenging due to the complex, instance-specific interaction dynamics of both humans and objects. Incorporating controllability in video generation further adds to the complexity. Existing controllable video generation approaches face a trade-off: sparse controls like keypoint trajectories are easy to specify but lack instance-awareness, while dense signals such as optical flow, depths or 3D meshes are informative but costly to obtain. We propose VHOI, a two-stage framework that first densifies sparse trajectories into HOI mask sequences, and then fine-tunes a video diffusion model conditioned on these dense masks. We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics. This design incorporates a human prior into the conditioning signal and strengthens the model's ability to understand and generate realistic HOI dynamics. Experiments demonstrate state-of-the-art results in controllable HOI video generation. VHOI is not limited to interaction-only scenarios and can also generate full human navigation leading up to object interactions in an end-to-end manner. Project page: https://vcai.mpi-inf.mpg.de/projects/vhoi/.

Paper Structure

This paper contains 27 sections, 7 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Motion Representation. We visualize two frames of colored sparse trajectories alongside the three intermediate motion representations studied in this work. (a) The sparse trajectory representation, where different colors denote different human parts or objects. (b) HOI masks (ours): constructed by combining object masks ren2024grounded with part-level human segmentation khirodkar2024sapiens, each assigned a consistent color to encode fine-grained HOI semantics. (c) Instance masks: a coarser alternative that distinguishes only human and object regions, lacking part-level detail and interaction awareness. (d) Foreground optical flow: computed via RAFT teed2020raft and masked to foreground regions; while the color encoding reflects motion magnitude and direction, it does not convey part-level or HOI-specific semantics.
  • Figure 2: The trajectory augmentor $\boldsymbol{\mathcal{A}}$ receives sparse trajectories and the corresponding visibility maps (optional) as inputs. The trajectories are processed by a trajectory extractor and fused with transformer latents and visibility cues in the augmentor fuser, producing a sequence of HOI masks that densifies the sparse control signals, used in the dense control model $\boldsymbol{\mathcal{D}}$ as shown in \ref{['fig:method_dense']}. Orange modules denote learnable components; blue modules are frozen.
  • Figure 3: The dense control model$\boldsymbol{\mathcal{D}}$ conditions on HOI masks. The masks are encoded by a HOI extractor and fused with transformer latents in the dense control fuser, which also includes a confidence prediction head to modulate reliance on the control signal. The final output is an HOI video that follows the densified motion cues. Orange modules denote learnable components; blue modules are frozen. (Best viewed with zoom)
  • Figure 4: Qualitative comparisons of TORA-finetuned (TORA*), Go-with-the-Flow (Go-Flow), and our method alongside ground-truth videos. Our approach achieves higher interaction fidelity and visual quality across diverse HOI scenarios.
  • Figure 5: Qualitative ablation of different motion representations. We compare augmentors trained on foreground optical flow, instance masks, and our proposed HOI masks. Flow-based conditioning lacks interaction semantics and fails to capture the grasp in this example. Instance-mask conditioning predicts the interaction but does not preserve object identity. Our HOI mask representation provides richer interaction semantics and leads to higher-quality video generation.
  • ...and 4 more figures