Table of Contents
Fetching ...

EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion

Guangyao Zhai, Evin Pınar Örnek, Dave Zhenyu Chen, Ruotong Liao, Yan Di, Nassir Navab, Federico Tombari, Benjamin Busam

TL;DR

EchoScene addresses unconditional and controllable 3D indoor scene generation from dynamic scene graphs by introducing a dual-branch diffusion framework (layout and shape) coupled with an information echo scheme. Each graph node maintains its own denoising process, while an exchange unit aggregates global graph information via graph convolutions at every denoising step, ensuring adherence to the scene graph constraints. The layout branch models bounding boxes with per-node diffusion and layout echoes, while the shape branch uses a VQ-VAE latent for per-object shapes with shape echoes; both branches are trained jointly with losses $\mathcal{L}_{layout}$ and $\mathcal{L}_{shape}$. Experiments on SG-FRONT demonstrate higher generation fidelity and stronger graph-constraint robustness than prior work, with qualitative evidence of improved inter-object coherence and compatibility with texture generation pipelines like SceneTex. The approach offers a scalable, controllable path toward editing 3D indoor scenes via graph manipulation while enabling downstream photorealistic texturing workflows.

Abstract

We present EchoScene, an interactive and controllable generative model that generates 3D indoor scenes on scene graphs. EchoScene leverages a dual-branch diffusion model that dynamically adapts to scene graphs. Existing methods struggle to handle scene graphs due to varying numbers of nodes, multiple edge combinations, and manipulator-induced node-edge operations. EchoScene overcomes this by associating each node with a denoising process and enables collaborative information exchange, enhancing controllable and consistent generation aware of global constraints. This is achieved through an information echo scheme in both shape and layout branches. At every denoising step, all processes share their denoising data with an information exchange unit that combines these updates using graph convolution. The scheme ensures that the denoising processes are influenced by a holistic understanding of the scene graph, facilitating the generation of globally coherent scenes. The resulting scenes can be manipulated during inference by editing the input scene graph and sampling the noise in the diffusion model. Extensive experiments validate our approach, which maintains scene controllability and surpasses previous methods in generation fidelity. Moreover, the generated scenes are of high quality and thus directly compatible with off-the-shelf texture generation. Code and trained models are open-sourced.

EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion

TL;DR

EchoScene addresses unconditional and controllable 3D indoor scene generation from dynamic scene graphs by introducing a dual-branch diffusion framework (layout and shape) coupled with an information echo scheme. Each graph node maintains its own denoising process, while an exchange unit aggregates global graph information via graph convolutions at every denoising step, ensuring adherence to the scene graph constraints. The layout branch models bounding boxes with per-node diffusion and layout echoes, while the shape branch uses a VQ-VAE latent for per-object shapes with shape echoes; both branches are trained jointly with losses and . Experiments on SG-FRONT demonstrate higher generation fidelity and stronger graph-constraint robustness than prior work, with qualitative evidence of improved inter-object coherence and compatibility with texture generation pipelines like SceneTex. The approach offers a scalable, controllable path toward editing 3D indoor scenes via graph manipulation while enabling downstream photorealistic texturing workflows.

Abstract

We present EchoScene, an interactive and controllable generative model that generates 3D indoor scenes on scene graphs. EchoScene leverages a dual-branch diffusion model that dynamically adapts to scene graphs. Existing methods struggle to handle scene graphs due to varying numbers of nodes, multiple edge combinations, and manipulator-induced node-edge operations. EchoScene overcomes this by associating each node with a denoising process and enables collaborative information exchange, enhancing controllable and consistent generation aware of global constraints. This is achieved through an information echo scheme in both shape and layout branches. At every denoising step, all processes share their denoising data with an information exchange unit that combines these updates using graph convolution. The scheme ensures that the denoising processes are influenced by a holistic understanding of the scene graph, facilitating the generation of globally coherent scenes. The resulting scenes can be manipulated during inference by editing the input scene graph and sampling the noise in the diffusion model. Extensive experiments validate our approach, which maintains scene controllability and surpasses previous methods in generation fidelity. Moreover, the generated scenes are of high quality and thus directly compatible with off-the-shelf texture generation. Code and trained models are open-sourced.
Paper Structure (22 sections, 6 equations, 11 figures, 5 tables)

This paper contains 22 sections, 6 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: EchoScene Schematic. EchoScene uses a dual-branch diffusion model to generate 3D scenes from scene graphs. In both branches, each node is allocated a denoising process, and different processes are aware of global states through layout and shape "echoes" (waves in different colors) with an information exchange unit (grey block) along the denoising steps.
  • Figure 2: Overview of EchoScene. Our pipeline consists of graph preprocessing and two collaborative branches Layout Branch and Shape Branch. The details of two branches in one step are shown in Fig. \ref{['fig:detail']}. During inference, EchoScene evolves the contextual graph to the latent space, where a manipulator optionally adjust the graph by editing nodes and edges. Then, EchoScene samples a random noise from Gaussian Distribution $\mathcal{B}$ and $\mathcal{S}$ for both branches conditioned on the latent graph to generate shapes and layouts. Finally, the generated shapes are populated into layouts to synthesize the scenes. Moreover, an external texture generator chen2023scenetex can be optionally utilized to provide a more photorealistic appearance.
  • Figure 3: One Step of Dual-Branch Information Echo. For each time step, we encourage the layout (left) and shape (right) branches to exchange information within each branch for all objects in the same scene.
  • Figure 3: Inter-object Consistency. The consistent object shapes within a scene are indicated by low CD values ($\times 0.001$).
  • Figure 4: Comparisons with other generative methods. Input scene graphs have more edges between two nodes than the ones visualized here. Red rectangles highlight the inconsistent generation. (Zoom for details)
  • ...and 6 more figures