Table of Contents
Fetching ...

SPREAD: Spatial-Physical REasoning via geometry Aware Diffusion

Minzhang Li, Kuixiang Shao, Xuebing Li, Yuyang Jiao, Yinuo Bai, Hengan Zhou, Sixian Shen, Jiayuan Gu, Jingyi Yu

Abstract

Automated 3D scene generation is pivotal for applications spanning virtual reality, digital content creation, and Embodied AI. While computer graphics prioritizes aesthetic layouts, vision and robotics demand scenes that mirror real-world complexity which current data-driven methods struggle to achieve due to limited unstructured training data and insufficient spatial and physical modeling. We propose SPREAD, a diffusion-based framework that jointly learns spatial and physical relationships through a graph transformer, explicitly conditioning on posed scene point clouds for geometric awareness. Moreover, our model integrates differentiable guidance for collision avoidance, relational constraint, and gravity, ensuring physically coherent scenes without sacrificing relational context. Our experiments on 3D-FRONT and ProcTHOR datasets demonstrate state-of-the-art performance in spatial-relational reasoning and physical metrics. Moreover, \ours{} outperforms baselines in scene consistency and stability during pre- and post-physics simulation, proving its capability to generate simulation-ready environments for embodied AI agents.

SPREAD: Spatial-Physical REasoning via geometry Aware Diffusion

Abstract

Automated 3D scene generation is pivotal for applications spanning virtual reality, digital content creation, and Embodied AI. While computer graphics prioritizes aesthetic layouts, vision and robotics demand scenes that mirror real-world complexity which current data-driven methods struggle to achieve due to limited unstructured training data and insufficient spatial and physical modeling. We propose SPREAD, a diffusion-based framework that jointly learns spatial and physical relationships through a graph transformer, explicitly conditioning on posed scene point clouds for geometric awareness. Moreover, our model integrates differentiable guidance for collision avoidance, relational constraint, and gravity, ensuring physically coherent scenes without sacrificing relational context. Our experiments on 3D-FRONT and ProcTHOR datasets demonstrate state-of-the-art performance in spatial-relational reasoning and physical metrics. Moreover, \ours{} outperforms baselines in scene consistency and stability during pre- and post-physics simulation, proving its capability to generate simulation-ready environments for embodied AI agents.

Paper Structure

This paper contains 39 sections, 10 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Illustration of SPREAD, a diffusion-based framework for generating physically plausible 3D scenes with rich object interactions. (A) SPREAD synthesizes detailed object-level layouts with natural spatial and physical interactions, going beyond coarse layout arrangements. (B) SPREAD faithfully adheres to provided spatial and physical graph priors, $\mathcal{G}$. (C) SPREAD can provide simulation-ready environments for embodied AI agents.
  • Figure 2: Overview of SPREAD. We propose SPREAD, a diffusion-based framework for generating physically plausible 3D scenes, which integrates relational constraints through spatial ($\mathcal{G_{\rho}}$) and physical graphs ($\mathcal{G_{\kappa}}$) while leveraging geometric perception via Perceiver Layers. The model employs graph-attention guided diffusion to jointly optimize physical plausibility and spatial relations during generation, producing realistic scenes with natural object interactions.
  • Figure 3: Comparative Generation and Simulation Results. Visual comparison of scene layouts produced by our method versus three baseline approaches, shown before (left) and after (right) physics simulation.
  • Figure 4: Scene & Relation Visualization. For two generated scenes, we show the final render (left), the top-down layout (middle), and the pairwise relation evaluation matrix (right). The matrix encodes every object-pair’s spatial relation: green entries denote correct relations (w.r.t. the ground truth), and red entries denote incorrect ones.
  • Figure 5: Guidance Ablation. Results showing effect of different guidance terms. Each major row compares results before (top) and after (bottom) adding a specific guidance. Columns show different scenes. Red circles highlight issues such as collisions, floating, or incorrect spatial relations before guidance; green circles show improvements after applying guidance, with zoom-in views for clarity.
  • ...and 4 more figures