HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception
Wei Yao, Yunlian Sun, Hongwen Zhang, Yebin Liu, Jinhui Tang
TL;DR
HOSIG addresses the challenge of synthesizing high-fidelity full-body human–object–scene interactions in complex indoor environments by decoupling the task into scene-aware grasp pose generation, obstacle-aware navigation, and scene-guided controllable motion generation. The framework employs a cVAE-based SGAP with a physics-informed scene distance loss, a 2D obstacle-aware map for efficient heuristic navigation, and a ControlNet-inspired diffusion model (SCoMoGen) conditioned on spatial anchors and gradient-guided constraints to achieve finger-level accuracy. On TRUMANS, HOSIG achieves superior object locomotion, reduced penetrations, and robust hand–object contact, with unlimited motion length via autoregressive generation and minimal manual intervention. By unifying scene perception, navigation, and manipulation, the work advances embodied interaction synthesis for VR, robotics, and animation with practical, scalable performance.
Abstract
Generating high-fidelity full-body human interactions with dynamic objects and static scenes remains a critical challenge in computer graphics and animation. Existing methods for human-object interaction often neglect scene context, leading to implausible penetrations, while human-scene interaction approaches struggle to coordinate fine-grained manipulations with long-range navigation. To address these limitations, we propose HOSIG, a novel framework for synthesizing full-body interactions through hierarchical scene perception. Our method decouples the task into three key components: 1) a scene-aware grasp pose generator that ensures collision-free whole-body postures with precise hand-object contact by integrating local geometry constraints, 2) a heuristic navigation algorithm that autonomously plans obstacle-avoiding paths in complex indoor environments via compressed 2D floor maps and dual-component spatial reasoning, and 3) a scene-guided motion diffusion model that generates trajectory-controlled, full-body motions with finger-level accuracy by incorporating spatial anchors and dual-space classifier-free guidance. Extensive experiments on the TRUMANS dataset demonstrate superior performance over state-of-the-art methods. Notably, our framework supports unlimited motion length through autoregressive generation and requires minimal manual intervention. This work bridges the critical gap between scene-aware navigation and dexterous object manipulation, advancing the frontier of embodied interaction synthesis. Codes will be available after publication. Project page: http://yw0208.github.io/hosig
