Table of Contents
Fetching ...

SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens

Anindita Ghosh, Vladislav Golyanik, Taku Komura, Philipp Slusallek, Christian Theobalt, Rishabh Dabral

TL;DR

SceMoS achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark, reducing the number of trainable parameters for scene encoding by over 50%, showing that 2D scene cues can effectively ground 3D human-scene interaction.

Abstract

Synthesizing text-driven 3D human motion within realistic scenes requires learning both semantic intent ("walk to the couch") and physical feasibility (e.g., avoiding collisions). Current methods use generative frameworks that simultaneously learn high-level planning and low-level contact reasoning, and rely on computationally expensive 3D scene data such as point clouds or voxel occupancy grids. We propose SceMoS, a scene-aware motion synthesis framework that shows that structured 2D scene representations can serve as a powerful alternative to full 3D supervision in physically grounded motion synthesis. SceMoS disentangles global planning from local execution using lightweight 2D cues and relying on (1) a text-conditioned autoregressive global motion planner that operates on a bird's-eye-view (BEV) image rendered from an elevated corner of the scene, encoded with DINOv2 features, as the scene representation, and (2) a geometry-grounded motion tokenizer trained via a conditional VQ-VAE, that uses 2D local scene heightmap, thus embedding surface physics directly into a discrete vocabulary. This 2D factorization reaches an efficiency-fidelity trade-off: BEV semantics capture spatial layout and affordance for global reasoning, while local heightmaps enforce fine-grained physical adherence without full 3D volumetric reasoning. SceMoS achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark, reducing the number of trainable parameters for scene encoding by over 50%, showing that 2D scene cues can effectively ground 3D human-scene interaction.

SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens

TL;DR

SceMoS achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark, reducing the number of trainable parameters for scene encoding by over 50%, showing that 2D scene cues can effectively ground 3D human-scene interaction.

Abstract

Synthesizing text-driven 3D human motion within realistic scenes requires learning both semantic intent ("walk to the couch") and physical feasibility (e.g., avoiding collisions). Current methods use generative frameworks that simultaneously learn high-level planning and low-level contact reasoning, and rely on computationally expensive 3D scene data such as point clouds or voxel occupancy grids. We propose SceMoS, a scene-aware motion synthesis framework that shows that structured 2D scene representations can serve as a powerful alternative to full 3D supervision in physically grounded motion synthesis. SceMoS disentangles global planning from local execution using lightweight 2D cues and relying on (1) a text-conditioned autoregressive global motion planner that operates on a bird's-eye-view (BEV) image rendered from an elevated corner of the scene, encoded with DINOv2 features, as the scene representation, and (2) a geometry-grounded motion tokenizer trained via a conditional VQ-VAE, that uses 2D local scene heightmap, thus embedding surface physics directly into a discrete vocabulary. This 2D factorization reaches an efficiency-fidelity trade-off: BEV semantics capture spatial layout and affordance for global reasoning, while local heightmaps enforce fine-grained physical adherence without full 3D volumetric reasoning. SceMoS achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark, reducing the number of trainable parameters for scene encoding by over 50%, showing that 2D scene cues can effectively ground 3D human-scene interaction.
Paper Structure (27 sections, 7 equations, 6 figures, 4 tables)

This paper contains 27 sections, 7 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The introduced scene-aware 3D human motion synthesis framework, SceMoS, uses 2D scene cues and text instructions to generate physically consistent and realistic 3D motions. We use a bird's-eye-view (BEV) image rendered from an elevated corner of the input scene, and extract DINOv2 features for high-level semantic planning. For fine-grained contact reasoning, we use the local 2D heightmap of the scene around the root of the person's initial pose.
  • Figure 2: Overview of the SceMoS framework. SceMoS disentangles text-conditioned scene-aware human motion synthesis into two stages. (a) The global motion planner predicts discrete motion tokens from text input and DINOv2 scene features extracted from a BEV image. (b) Our geometry-grounded motion tokenizer learns a scene-aware motion vocabulary for mapping these discrete tokens to a continuous 3D human motion. We use 2D local heightmaps around poses to condition our interaction decoder (top right) for fine-grained interaction generation. The red dotted line implies used only during training. Blue arrows follow through the inference pipeline.
  • Figure 3: Visualization of long-range motion synthesis in a cluttered indoor environment. SceMoS performs geometry-grounded planning by recalculating heightmaps every $t$ frames, enabling globally coherent yet locally feasible motion planning that respects scene geometry. The BEV image input is shown in the inset.
  • Figure 4: Qualitative Comparison of SceMoS with recent HSI models. SceMoS generates motions that are semantically aligned with the input text instructions while maintaining stable contact and smooth transitions. In contrast, we observe some penetrations and misalignment (red circle) in some frames of the baselines.
  • Figure A.1: Interface of our user study where we ask participants to rank the motion clips based on 'realism' and 'semantics', in a 5-point Likert scale.
  • ...and 1 more figures