ZeroHSI: Zero-Shot 4D Human-Scene Interaction by Video Generation
Hongjie Li, Hong-Xing Yu, Jiaman Li, Jiajun Wu
TL;DR
ZeroHSI addresses the challenge of generating realistic 4D human–scene interactions in unseen environments without ground-truth motion data. It distills HSIs from state-of-the-art video-generation models and reconstructs 4D motion through differentiable rendering of Gaussian-based scene, object, and animatable human representations, guided by text prompts and initial poses. The approach combines per-frame optimization, camera pose refinement, object pose tracking, and refinement with a pose prior and physics losses, achieving strong semantic alignment, motion diversity, and physical plausibility across static and dynamic scenes. This zero-shot capability enables flexible synthesis of contextually appropriate interactions in reconstructed real-world scenes, with demonstrated long-term sequences and broad compatibility with evolving video-generation models.
Abstract
Human-scene interaction (HSI) generation is crucial for applications in embodied AI, virtual reality, and robotics. Yet, existing methods cannot synthesize interactions in unseen environments such as in-the-wild scenes or reconstructed scenes, as they rely on paired 3D scenes and captured human motion data for training, which are unavailable for unseen environments. We present ZeroHSI, a novel approach that enables zero-shot 4D human-scene interaction synthesis, eliminating the need for training on any MoCap data. Our key insight is to distill human-scene interactions from state-of-the-art video generation models, which have been trained on vast amounts of natural human movements and interactions, and use differentiable rendering to reconstruct human-scene interactions. ZeroHSI can synthesize realistic human motions in both static scenes and environments with dynamic objects, without requiring any ground-truth motion data. We evaluate ZeroHSI on a curated dataset of different types of various indoor and outdoor scenes with different interaction prompts, demonstrating its ability to generate diverse and contextually appropriate human-scene interactions.
