Table of Contents
Fetching ...

MMGDreamer: Mixed-Modality Graph for Geometry-Controllable 3D Indoor Scene Generation

Zhifei Yang, Keyang Lu, Chao Zhang, Jiaxing Qi, Hanqi Jiang, Ruifei Ma, Shenglin Yin, Yifan Xu, Mingzhe Xing, Zhen Xiao, Jieyi Long, Guangyao Zhai

TL;DR

MMGDreamer tackles geometry-aware controllable 3D indoor scene generation by fusing text and image inputs through a Mixed-Modality Graph (MMG). A dual-branch diffusion model operates on a Latent Mixed-Modality Graph, with a Graph Encoder and two denoisers to jointly synthesize layout and shapes; the diffusion process uses $q(z_t|z_{t-1}) = \mathcal{N}(z_t; \sqrt{1-\beta_t} z_{t-1}, \beta_t I)$ and minimizes $\mathcal{L}_{LDM}$. The MMG is refined by a Visual Enhancement Module and a Relation Predictor, enabling robust geometry control and coherent layouts, with CLIP-based multimodal encodings guiding the representations. On SG-FRONT, MMGDreamer sets new state-of-the-art in scene fidelity (FID/KID) and object geometry metrics, demonstrating strong practical impact for VR, interior design, and embodied scene understanding.

Abstract

Controllable 3D scene generation has extensive applications in virtual reality and interior design, where the generated scenes should exhibit high levels of realism and controllability in terms of geometry. Scene graphs provide a suitable data representation that facilitates these applications. However, current graph-based methods for scene generation are constrained to text-based inputs and exhibit insufficient adaptability to flexible user inputs, hindering the ability to precisely control object geometry. To address this issue, we propose MMGDreamer, a dual-branch diffusion model for scene generation that incorporates a novel Mixed-Modality Graph, visual enhancement module, and relation predictor. The mixed-modality graph allows object nodes to integrate textual and visual modalities, with optional relationships between nodes. It enhances adaptability to flexible user inputs and enables meticulous control over the geometry of objects in the generated scenes. The visual enhancement module enriches the visual fidelity of text-only nodes by constructing visual representations using text embeddings. Furthermore, our relation predictor leverages node representations to infer absent relationships between nodes, resulting in more coherent scene layouts. Extensive experimental results demonstrate that MMGDreamer exhibits superior control of object geometry, achieving state-of-the-art scene generation performance. Project page: https://yangzhifeio.github.io/project/MMGDreamer.

MMGDreamer: Mixed-Modality Graph for Geometry-Controllable 3D Indoor Scene Generation

TL;DR

MMGDreamer tackles geometry-aware controllable 3D indoor scene generation by fusing text and image inputs through a Mixed-Modality Graph (MMG). A dual-branch diffusion model operates on a Latent Mixed-Modality Graph, with a Graph Encoder and two denoisers to jointly synthesize layout and shapes; the diffusion process uses and minimizes . The MMG is refined by a Visual Enhancement Module and a Relation Predictor, enabling robust geometry control and coherent layouts, with CLIP-based multimodal encodings guiding the representations. On SG-FRONT, MMGDreamer sets new state-of-the-art in scene fidelity (FID/KID) and object geometry metrics, demonstrating strong practical impact for VR, interior design, and embodied scene understanding.

Abstract

Controllable 3D scene generation has extensive applications in virtual reality and interior design, where the generated scenes should exhibit high levels of realism and controllability in terms of geometry. Scene graphs provide a suitable data representation that facilitates these applications. However, current graph-based methods for scene generation are constrained to text-based inputs and exhibit insufficient adaptability to flexible user inputs, hindering the ability to precisely control object geometry. To address this issue, we propose MMGDreamer, a dual-branch diffusion model for scene generation that incorporates a novel Mixed-Modality Graph, visual enhancement module, and relation predictor. The mixed-modality graph allows object nodes to integrate textual and visual modalities, with optional relationships between nodes. It enhances adaptability to flexible user inputs and enables meticulous control over the geometry of objects in the generated scenes. The visual enhancement module enriches the visual fidelity of text-only nodes by constructing visual representations using text embeddings. Furthermore, our relation predictor leverages node representations to infer absent relationships between nodes, resulting in more coherent scene layouts. Extensive experimental results demonstrate that MMGDreamer exhibits superior control of object geometry, achieving state-of-the-art scene generation performance. Project page: https://yangzhifeio.github.io/project/MMGDreamer.

Paper Structure

This paper contains 49 sections, 14 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: MMGDreamer processes a Mixed-Modality Graph to generate a 3D indoor scene, where object geometry can be precisely controlled. Starting from the fifth type of input (Mixed-Modality) shown in module A as an example, the framework utilizes a vision-language model (B) to produce a Mixed-Modality Graph (C). This graph is further refined by the Generation Module (D) to create a coherent and precise 3D scene (E).
  • Figure 1: Failure case. The dashed box on the left is a top-down view rendered using the ground truth, while the result on the right is generated scene by MMGDreamer.
  • Figure 2: Overview of MMGDreamer. Our pipeline consists of the Latent Mixed-Modality Graph, the Graph Enhancement Module, and the Dual-Branch Diffusion Model. During inference, MMGDreamer initiates with the Latent Mixed-Modality Graph, which undergoes enhancement via the Visual Enhancement Module and the Relation Predictor, resulting in the formation of a Visual-Enhanced Graph and a Mixed-Enhanced Graph. The Mixed-Enhanced Graph is then input into the Graph Encoder $E_g$ within the Dual-Branch Diffusion Model for relationship modeling, using a triplet-GCN structured module integrated with an echo mechanism. Subsequently, the Layout Branch (C.2) and the Shape Branch (C.3) use denoisers conditioned on the nodes' latent representations to generate layouts and shapes, respectively. The final output is a synthesized 3D indoor scene where the generated shapes are seamlessly integrated into the generated layouts.
  • Figure 2: More qualitative comparison on scene generation. The first row shows the input mixed-modality graph, which visualizes only the most critical edges in the scene. Red rectangles denote areas of inconsistency in the generated scenes, while green rectangles signify regions of consistent generation.
  • Figure 3: Qualitative comparison with other methods. The first column shows the input mixed-modality graph, which visualizes only the most critical edges in the scene. Red rectangles denote areas of inconsistency in the generated scenes, while green rectangles signify regions of consistent generation.
  • ...and 3 more figures