Table of Contents
Fetching ...

Revisit Human-Scene Interaction via Space Occupancy

Xinpeng Liu, Haowen Hou, Yanchao Yang, Yong-Lu Li, Cewu Lu

TL;DR

This work addresses the data-hungry problem of static Human-Scene Interaction (HSI) by proposing that interacting with a scene is essentially interacting with its space occupancy. It introduces the Motion Occupancy Base (MOB), a large paired human-occupancy dataset created by converting motion-only data into occupancy representations, and trains a versatile auto-regressive Human-Occupancy Interaction controller with a field regulation module that generalizes across complex occupancy layouts and into static dynamic scenarios without GT 3D scenes. The approach yields stable, collision-avoiding HOI motions in MOB and in realistic rooms, and extends to object and dynamic-scene interactions, outperforming prior methods on several metrics while emphasizing data efficiency and flexibility. This work advances scalable HOI/HSA motion generation with broad implications for robotics, animation, and interior design, while noting potential risks related to misuse of realistic motion synthesis. Overall, the space-occupancy view and MOB-based training enable robust HSI motion generation across diverse environments with reduced reliance on expensive scene captures.

Abstract

Human-scene Interaction (HSI) generation is a challenging task and crucial for various downstream tasks. However, one of the major obstacles is its limited data scale. High-quality data with simultaneously captured human and 3D environments is hard to acquire, resulting in limited data diversity and complexity. In this work, we argue that interaction with a scene is essentially interacting with the space occupancy of the scene from an abstract physical perspective, leading us to a unified novel view of Human-Occupancy Interaction. By treating pure motion sequences as records of humans interacting with invisible scene occupancy, we can aggregate motion-only data into a large-scale paired human-occupancy interaction database: Motion Occupancy Base (MOB). Thus, the need for costly paired motion-scene datasets with high-quality scene scans can be substantially alleviated. With this new unified view of Human-Occupancy interaction, a single motion controller is proposed to reach the target state given the surrounding occupancy. Once trained on MOB with complex occupancy layout, which is stringent to human movements, the controller could handle cramped scenes and generalize well to general scenes with limited complexity like regular living rooms. With no GT 3D scenes for training, our method can generate realistic and stable HSI motions in diverse scenarios, including both static and dynamic scenes. The project is available at https://foruck.github.io/occu-page/.

Revisit Human-Scene Interaction via Space Occupancy

TL;DR

This work addresses the data-hungry problem of static Human-Scene Interaction (HSI) by proposing that interacting with a scene is essentially interacting with its space occupancy. It introduces the Motion Occupancy Base (MOB), a large paired human-occupancy dataset created by converting motion-only data into occupancy representations, and trains a versatile auto-regressive Human-Occupancy Interaction controller with a field regulation module that generalizes across complex occupancy layouts and into static dynamic scenarios without GT 3D scenes. The approach yields stable, collision-avoiding HOI motions in MOB and in realistic rooms, and extends to object and dynamic-scene interactions, outperforming prior methods on several metrics while emphasizing data efficiency and flexibility. This work advances scalable HOI/HSA motion generation with broad implications for robotics, animation, and interior design, while noting potential risks related to misuse of realistic motion synthesis. Overall, the space-occupancy view and MOB-based training enable robust HSI motion generation across diverse environments with reduced reliance on expensive scene captures.

Abstract

Human-scene Interaction (HSI) generation is a challenging task and crucial for various downstream tasks. However, one of the major obstacles is its limited data scale. High-quality data with simultaneously captured human and 3D environments is hard to acquire, resulting in limited data diversity and complexity. In this work, we argue that interaction with a scene is essentially interacting with the space occupancy of the scene from an abstract physical perspective, leading us to a unified novel view of Human-Occupancy Interaction. By treating pure motion sequences as records of humans interacting with invisible scene occupancy, we can aggregate motion-only data into a large-scale paired human-occupancy interaction database: Motion Occupancy Base (MOB). Thus, the need for costly paired motion-scene datasets with high-quality scene scans can be substantially alleviated. With this new unified view of Human-Occupancy interaction, a single motion controller is proposed to reach the target state given the surrounding occupancy. Once trained on MOB with complex occupancy layout, which is stringent to human movements, the controller could handle cramped scenes and generalize well to general scenes with limited complexity like regular living rooms. With no GT 3D scenes for training, our method can generate realistic and stable HSI motions in diverse scenarios, including both static and dynamic scenes. The project is available at https://foruck.github.io/occu-page/.
Paper Structure (30 sections, 9 equations, 13 figures, 4 tables)

This paper contains 30 sections, 9 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: We propose that interacting with scenes is essentially interacting with its space occupancy for static HSI. In this view, we can unify motion-only data into a unified human-occupancy knowledge base and train a versatile Human-Occupancy Interaction controller upon it, achieving stable generation under various scenarios.
  • Figure 2: The construction process of Motion Occupancy Base. Note that the ceilings of the occupancy are hidden for clarity.
  • Figure 3: The architecture of our versatile motion controller. Given motion state and histories $x^t,h^t$, and control signals $c_o^t, c_t^t$, the controller auto-regressively generates the next motion frame w.r.t. the target pose and the canonical occupancy.
  • Figure 4: Occupied voxel centers $\{P_i\}$ are located at the top left of the figure. The initial velocity vector, $\dot{p}_j$, is directed upwards, posing a risk of collision. To mitigate this, the occupancy field introduces a corrective velocity component $\Delta \dot{p}_j$. This redirects the final velocity $\dot{p}_j^{out}$ to avoid collision.
  • Figure 5: Our controller can naturally reach the target (white to blue) given the complex occupancy. Furthermore, it can stabilize the motions (red to white) around the targets after reaching them. Previous SOTA zhao2023synthesizing suffers from severe penetration and fails to reach the target. For simplicity, the ceilings of the occupancy are hidden.
  • ...and 8 more figures