VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification
Wanyue Zhang, Lin Geng Foo, Thabo Beeler, Rishabh Dabral, Christian Theobalt
TL;DR
This work tackles controllable video generation of human-object interactions by bridging sparse user cues and dense motion signals. It introduces VHOI, a two-stage approach that densifies sparse trajectories into HOI masks with an Augmentor, then synthesizes HOI videos with a Dense Control Model conditioned on these masks. The method uses HOI-aware motion representations, including a part-aware color palette and gating mechanisms, to achieve robust, instance-aware HOI dynamics and navigation-before-interaction sequences, achieving state-of-the-art results across multiple HOI benchmarks. The approach offers practical benefits for animation workflows and synthetic data generation in robotics, while highlighting areas for future improvement in identity preservation and 3D awareness.
Abstract
Synthesizing realistic human-object interactions (HOI) in video is challenging due to the complex, instance-specific interaction dynamics of both humans and objects. Incorporating controllability in video generation further adds to the complexity. Existing controllable video generation approaches face a trade-off: sparse controls like keypoint trajectories are easy to specify but lack instance-awareness, while dense signals such as optical flow, depths or 3D meshes are informative but costly to obtain. We propose VHOI, a two-stage framework that first densifies sparse trajectories into HOI mask sequences, and then fine-tunes a video diffusion model conditioned on these dense masks. We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics. This design incorporates a human prior into the conditioning signal and strengthens the model's ability to understand and generate realistic HOI dynamics. Experiments demonstrate state-of-the-art results in controllable HOI video generation. VHOI is not limited to interaction-only scenarios and can also generate full human navigation leading up to object interactions in an end-to-end manner. Project page: https://vcai.mpi-inf.mpg.de/projects/vhoi/.
