SCENIC: Scene-aware Semantic Navigation with Instruction-guided Control
Xiaohan Zhang, Sebastian Starke, Vladimir Guzov, Zhensong Zhang, Eduardo Pérez Pellitero, Gerard Pons-Moll
TL;DR
SCENIC addresses the challenge of synthesizing natural, long-horizon human motion that adapts to complex 3D terrains while following textual instructions. It introduces a diffusion-based framework with hierarchical scene reasoning, comprising goal-centric canonicalization and a local ego-centric distance field to separately handle high-level navigation and fine-grained geometry. The method integrates per-frame text alignment, autoregressive diffusion, and an object-interaction pathway, supported by scene-aware guidance during inference to ensure physical plausibility. Across four real-world datasets, SCENIC achieves state-of-the-art constraint satisfaction and motion realism, with user studies showing strong subjective preference, highlighting its practical potential for gaming, embodied AI, and virtual humans.
Abstract
Synthesizing natural human motion that adapts to complex environments while allowing creative control remains a fundamental challenge in motion synthesis. Existing models often fall short, either by assuming flat terrain or lacking the ability to control motion semantics through text. To address these limitations, we introduce SCENIC, a diffusion model designed to generate human motion that adapts to dynamic terrains within virtual scenes while enabling semantic control through natural language. The key technical challenge lies in simultaneously reasoning about complex scene geometry while maintaining text control. This requires understanding both high-level navigation goals and fine-grained environmental constraints. The model must ensure physical plausibility and precise navigation across varied terrain, while also preserving user-specified text control, such as ``carefully stepping over obstacles" or ``walking upstairs like a zombie." Our solution introduces a hierarchical scene reasoning approach. At its core is a novel scene-dependent, goal-centric canonicalization that handles high-level goal constraint, and is complemented by an ego-centric distance field that captures local geometric details. This dual representation enables our model to generate physically plausible motion across diverse 3D scenes. By implementing frame-wise text alignment, our system achieves seamless transitions between different motion styles while maintaining scene constraints. Experiments demonstrate our novel diffusion model generates arbitrarily long human motions that both adapt to complex scenes with varying terrain surfaces and respond to textual prompts. Additionally, we show SCENIC can generalize to four real-scene datasets. Our code, dataset, and models will be released at \url{https://virtualhumans.mpi-inf.mpg.de/scenic/}.
