Table of Contents
Fetching ...

SCENIC: Scene-aware Semantic Navigation with Instruction-guided Control

Xiaohan Zhang, Sebastian Starke, Vladimir Guzov, Zhensong Zhang, Eduardo Pérez Pellitero, Gerard Pons-Moll

TL;DR

SCENIC addresses the challenge of synthesizing natural, long-horizon human motion that adapts to complex 3D terrains while following textual instructions. It introduces a diffusion-based framework with hierarchical scene reasoning, comprising goal-centric canonicalization and a local ego-centric distance field to separately handle high-level navigation and fine-grained geometry. The method integrates per-frame text alignment, autoregressive diffusion, and an object-interaction pathway, supported by scene-aware guidance during inference to ensure physical plausibility. Across four real-world datasets, SCENIC achieves state-of-the-art constraint satisfaction and motion realism, with user studies showing strong subjective preference, highlighting its practical potential for gaming, embodied AI, and virtual humans.

Abstract

Synthesizing natural human motion that adapts to complex environments while allowing creative control remains a fundamental challenge in motion synthesis. Existing models often fall short, either by assuming flat terrain or lacking the ability to control motion semantics through text. To address these limitations, we introduce SCENIC, a diffusion model designed to generate human motion that adapts to dynamic terrains within virtual scenes while enabling semantic control through natural language. The key technical challenge lies in simultaneously reasoning about complex scene geometry while maintaining text control. This requires understanding both high-level navigation goals and fine-grained environmental constraints. The model must ensure physical plausibility and precise navigation across varied terrain, while also preserving user-specified text control, such as ``carefully stepping over obstacles" or ``walking upstairs like a zombie." Our solution introduces a hierarchical scene reasoning approach. At its core is a novel scene-dependent, goal-centric canonicalization that handles high-level goal constraint, and is complemented by an ego-centric distance field that captures local geometric details. This dual representation enables our model to generate physically plausible motion across diverse 3D scenes. By implementing frame-wise text alignment, our system achieves seamless transitions between different motion styles while maintaining scene constraints. Experiments demonstrate our novel diffusion model generates arbitrarily long human motions that both adapt to complex scenes with varying terrain surfaces and respond to textual prompts. Additionally, we show SCENIC can generalize to four real-scene datasets. Our code, dataset, and models will be released at \url{https://virtualhumans.mpi-inf.mpg.de/scenic/}.

SCENIC: Scene-aware Semantic Navigation with Instruction-guided Control

TL;DR

SCENIC addresses the challenge of synthesizing natural, long-horizon human motion that adapts to complex 3D terrains while following textual instructions. It introduces a diffusion-based framework with hierarchical scene reasoning, comprising goal-centric canonicalization and a local ego-centric distance field to separately handle high-level navigation and fine-grained geometry. The method integrates per-frame text alignment, autoregressive diffusion, and an object-interaction pathway, supported by scene-aware guidance during inference to ensure physical plausibility. Across four real-world datasets, SCENIC achieves state-of-the-art constraint satisfaction and motion realism, with user studies showing strong subjective preference, highlighting its practical potential for gaming, embodied AI, and virtual humans.

Abstract

Synthesizing natural human motion that adapts to complex environments while allowing creative control remains a fundamental challenge in motion synthesis. Existing models often fall short, either by assuming flat terrain or lacking the ability to control motion semantics through text. To address these limitations, we introduce SCENIC, a diffusion model designed to generate human motion that adapts to dynamic terrains within virtual scenes while enabling semantic control through natural language. The key technical challenge lies in simultaneously reasoning about complex scene geometry while maintaining text control. This requires understanding both high-level navigation goals and fine-grained environmental constraints. The model must ensure physical plausibility and precise navigation across varied terrain, while also preserving user-specified text control, such as ``carefully stepping over obstacles" or ``walking upstairs like a zombie." Our solution introduces a hierarchical scene reasoning approach. At its core is a novel scene-dependent, goal-centric canonicalization that handles high-level goal constraint, and is complemented by an ego-centric distance field that captures local geometric details. This dual representation enables our model to generate physically plausible motion across diverse 3D scenes. By implementing frame-wise text alignment, our system achieves seamless transitions between different motion styles while maintaining scene constraints. Experiments demonstrate our novel diffusion model generates arbitrarily long human motions that both adapt to complex scenes with varying terrain surfaces and respond to textual prompts. Additionally, we show SCENIC can generalize to four real-scene datasets. Our code, dataset, and models will be released at \url{https://virtualhumans.mpi-inf.mpg.de/scenic/}.

Paper Structure

This paper contains 28 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 2: Architecture overview. SCENIC has a 3D scene, a user-defined trajectory, and text prompts, and the past human motion as inputs. The past human motion and the scene encoding first undergo goal-centric canonicalization. The diffusion-based transformer then encodes the aligned text-motion tokens, scene tokens and a timestamp token to predict the canonicalized future human motion.
  • Figure 3: Qualitative comparison with baselines. Results are on the testing set of the SCENIC dataset (top two rows). Without the hierarchical reasoning of the scene, the baseline methods produce more penetration with the legs (first row) and the floating effect (second row). Furthermore, our method generalizes to real-world scene datasets of HPS HPS and MatterPort3D Matterport3D (bottom two rows)
  • Figure 4: Ablation on the human-centric scene embedding. It is significant in preventing unwanted interactions with cluttered environments.
  • Figure 5: SCENIC generalizes to novel scenes and text instructions, as demonstrated with Replica replica19arxiv and HPS HPS scenarios. The model follows instructions like take a walk, sit on the sofa, and run up the stairs, and adapts to more complex commands such as jump over a stool while adjusting to scene constraints. In the HPS scene, the model transits between different gait styles, following the text control while adapting to the staircases.
  • Figure 6: The layout of our perceptual study for evaluating perceived realism, compliance of scene constraints, and text-based controllability of SCENIC .