Table of Contents
Fetching ...

HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception

Wei Yao, Yunlian Sun, Hongwen Zhang, Yebin Liu, Jinhui Tang

TL;DR

HOSIG addresses the challenge of synthesizing high-fidelity full-body human–object–scene interactions in complex indoor environments by decoupling the task into scene-aware grasp pose generation, obstacle-aware navigation, and scene-guided controllable motion generation. The framework employs a cVAE-based SGAP with a physics-informed scene distance loss, a 2D obstacle-aware map for efficient heuristic navigation, and a ControlNet-inspired diffusion model (SCoMoGen) conditioned on spatial anchors and gradient-guided constraints to achieve finger-level accuracy. On TRUMANS, HOSIG achieves superior object locomotion, reduced penetrations, and robust hand–object contact, with unlimited motion length via autoregressive generation and minimal manual intervention. By unifying scene perception, navigation, and manipulation, the work advances embodied interaction synthesis for VR, robotics, and animation with practical, scalable performance.

Abstract

Generating high-fidelity full-body human interactions with dynamic objects and static scenes remains a critical challenge in computer graphics and animation. Existing methods for human-object interaction often neglect scene context, leading to implausible penetrations, while human-scene interaction approaches struggle to coordinate fine-grained manipulations with long-range navigation. To address these limitations, we propose HOSIG, a novel framework for synthesizing full-body interactions through hierarchical scene perception. Our method decouples the task into three key components: 1) a scene-aware grasp pose generator that ensures collision-free whole-body postures with precise hand-object contact by integrating local geometry constraints, 2) a heuristic navigation algorithm that autonomously plans obstacle-avoiding paths in complex indoor environments via compressed 2D floor maps and dual-component spatial reasoning, and 3) a scene-guided motion diffusion model that generates trajectory-controlled, full-body motions with finger-level accuracy by incorporating spatial anchors and dual-space classifier-free guidance. Extensive experiments on the TRUMANS dataset demonstrate superior performance over state-of-the-art methods. Notably, our framework supports unlimited motion length through autoregressive generation and requires minimal manual intervention. This work bridges the critical gap between scene-aware navigation and dexterous object manipulation, advancing the frontier of embodied interaction synthesis. Codes will be available after publication. Project page: http://yw0208.github.io/hosig

HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception

TL;DR

HOSIG addresses the challenge of synthesizing high-fidelity full-body human–object–scene interactions in complex indoor environments by decoupling the task into scene-aware grasp pose generation, obstacle-aware navigation, and scene-guided controllable motion generation. The framework employs a cVAE-based SGAP with a physics-informed scene distance loss, a 2D obstacle-aware map for efficient heuristic navigation, and a ControlNet-inspired diffusion model (SCoMoGen) conditioned on spatial anchors and gradient-guided constraints to achieve finger-level accuracy. On TRUMANS, HOSIG achieves superior object locomotion, reduced penetrations, and robust hand–object contact, with unlimited motion length via autoregressive generation and minimal manual intervention. By unifying scene perception, navigation, and manipulation, the work advances embodied interaction synthesis for VR, robotics, and animation with practical, scalable performance.

Abstract

Generating high-fidelity full-body human interactions with dynamic objects and static scenes remains a critical challenge in computer graphics and animation. Existing methods for human-object interaction often neglect scene context, leading to implausible penetrations, while human-scene interaction approaches struggle to coordinate fine-grained manipulations with long-range navigation. To address these limitations, we propose HOSIG, a novel framework for synthesizing full-body interactions through hierarchical scene perception. Our method decouples the task into three key components: 1) a scene-aware grasp pose generator that ensures collision-free whole-body postures with precise hand-object contact by integrating local geometry constraints, 2) a heuristic navigation algorithm that autonomously plans obstacle-avoiding paths in complex indoor environments via compressed 2D floor maps and dual-component spatial reasoning, and 3) a scene-guided motion diffusion model that generates trajectory-controlled, full-body motions with finger-level accuracy by incorporating spatial anchors and dual-space classifier-free guidance. Extensive experiments on the TRUMANS dataset demonstrate superior performance over state-of-the-art methods. Notably, our framework supports unlimited motion length through autoregressive generation and requires minimal manual intervention. This work bridges the critical gap between scene-aware navigation and dexterous object manipulation, advancing the frontier of embodied interaction synthesis. Codes will be available after publication. Project page: http://yw0208.github.io/hosig

Paper Structure

This paper contains 28 sections, 12 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Human-Object-Scene Interaction Generation. Our proposed HOSIG can generate high-fidelity full-body human motions. HOSIG can not only generate interactions with static scenes, but also generate object manipulation motions with fine hand-object contact. Moreover, relying on iterative generation and autonomous navigation, HOISG can generate long-term motions in complex indoor scenes.
  • Figure 2: Overview of Our Pipeline. HOSIG can iteratively generate long-term motions based on spatial information, text, and the previous motion clip. There are three parts worth noting in the pipeline: (1) SGAP generates fine grasping postures to ensure the quality of character interaction. (2) Heuristic Navigation generates sparse human root joint trajectories to constrain the subsequent generated motions to be within the traversable area. (3) SCoMoGen uses a dual-branch design to achieve spatial control and adds additional joint & scene guidance during inference to achieve high-precision control.
  • Figure 3: Visulization of Scene-Aware in SGAP. The green box is centered on the purple bottle. In SGAP, only the sparse scene point cloud inside the box is used, as shown in the right figure.
  • Figure 4: Pipeline of Heuristic Navigation. The blue ball in the original scene is the starting point, and the red ball is the end point. The obstacle-aware map is presented in the form of a heat map, and the values correspond to the axis at the bottom. In Heuristic Function, the dark blue dot represents the current node, the light blue dot represents the candidate node, and the red star represents the end point.
  • Figure 5: Visulization of Guidance in SCoMoGen. Precise control of motions is achieved through gradient-based guidance. Green arrows represent joints being attracted by green anchors (hand joints, path waypoints). Red arrows represent the repulsive force on the joint (red body joint) close to the scene.
  • ...and 3 more figures