Table of Contents
Fetching ...

Human Motion Synthesis in 3D Scenes via Unified Scene Semantic Occupancy

Gong Jingyu, Tong Kunkun, Chen Zhuoran, Yuan Chuanhan, Chen Mingang, Zhang Zhizhong, Tan Xin, Xie Yuan

TL;DR

SSOMotion addresses the challenge of generating human motion in 3D scenes by integrating fine-grained scene semantics with geometry through a unified Scene Semantic Occupancy (SSO). It introduces a bi-directional tri-plane decomposition to create a compact SSO, and maps semantic categories into a unified space via a CLIP-based embedding with dimensionality reduction, producing scene features that guide a diffusion-based motion model. A dedicated motion-control branch learns to align generated motion with textual instructions and scene constraints, using frame-wise scene queries and cross-attention mechanisms to achieve goal-directed behavior. Extensive experiments across ShapeNet-based cluttered scenes and real-scanned PROX/Replica environments demonstrate state-of-the-art performance in navigation, interaction, and long-term motion synthesis, with improved generalization and notable efficiency thanks to the proposed semantic tri-plane representation and reduction strategy.

Abstract

Human motion synthesis in 3D scenes relies heavily on scene comprehension, while current methods focus mainly on scene structure but ignore the semantic understanding. In this paper, we propose a human motion synthesis framework that take an unified Scene Semantic Occupancy (SSO) for scene representation, termed SSOMotion. We design a bi-directional tri-plane decomposition to derive a compact version of the SSO, and scene semantics are mapped to an unified feature space via CLIP encoding and shared linear dimensionality reduction. Such strategy can derive the fine-grained scene semantic structures while significantly reduce redundant computations. We further take these scene hints and movement direction derived from instructions for motion control via frame-wise scene query. Extensive experiments and ablation studies conducted on cluttered scenes using ShapeNet furniture, as well as scanned scenes from PROX and Replica datasets, demonstrate its cutting-edge performance while validating its effectiveness and generalization ability. Code will be publicly available at https://github.com/jingyugong/SSOMotion.

Human Motion Synthesis in 3D Scenes via Unified Scene Semantic Occupancy

TL;DR

SSOMotion addresses the challenge of generating human motion in 3D scenes by integrating fine-grained scene semantics with geometry through a unified Scene Semantic Occupancy (SSO). It introduces a bi-directional tri-plane decomposition to create a compact SSO, and maps semantic categories into a unified space via a CLIP-based embedding with dimensionality reduction, producing scene features that guide a diffusion-based motion model. A dedicated motion-control branch learns to align generated motion with textual instructions and scene constraints, using frame-wise scene queries and cross-attention mechanisms to achieve goal-directed behavior. Extensive experiments across ShapeNet-based cluttered scenes and real-scanned PROX/Replica environments demonstrate state-of-the-art performance in navigation, interaction, and long-term motion synthesis, with improved generalization and notable efficiency thanks to the proposed semantic tri-plane representation and reduction strategy.

Abstract

Human motion synthesis in 3D scenes relies heavily on scene comprehension, while current methods focus mainly on scene structure but ignore the semantic understanding. In this paper, we propose a human motion synthesis framework that take an unified Scene Semantic Occupancy (SSO) for scene representation, termed SSOMotion. We design a bi-directional tri-plane decomposition to derive a compact version of the SSO, and scene semantics are mapped to an unified feature space via CLIP encoding and shared linear dimensionality reduction. Such strategy can derive the fine-grained scene semantic structures while significantly reduce redundant computations. We further take these scene hints and movement direction derived from instructions for motion control via frame-wise scene query. Extensive experiments and ablation studies conducted on cluttered scenes using ShapeNet furniture, as well as scanned scenes from PROX and Replica datasets, demonstrate its cutting-edge performance while validating its effectiveness and generalization ability. Code will be publicly available at https://github.com/jingyugong/SSOMotion.

Paper Structure

This paper contains 54 sections, 14 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Illustration of human motion synthesis within 3D Scene Semantic Occupancy (SSO). We decompose the SSO into bi-directional tri-plane as a unified scene representation. Human scene correlation is modeled as control signal for instruction-aware motion synthesis in 3D scenes.
  • Figure 2: Pipeline of the Scene Semantic Occupancy perception (SSO). (a) presents the Bi-directional Tri-plane Decomposition of the SSO, where scene color, semantics and depth are perceived in body-centered coordinate. In (b), we map the semantic labels into a unified semantic space via the CLIP textual encoder and a shared linear layer. Then, the unified low-dimension semantic features will be scattered into the semantic map. (c) indicates the normalization functions for distance and color space.
  • Figure 3: Overview of (a) the network for instruction-aware human motion synthesis in 3D scenes and (b) the motion controller based on Goal-directed Human Scene Correlation.
  • Figure 4: Visual comparison of locomotion synthesis between DIMOS and the proposed method.
  • Figure 5: Visual results given by DIMOS and the proposed method for sitting action.
  • ...and 8 more figures