Table of Contents
Fetching ...

M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation

Yiheng Zhang, Zhuojiang Cai, Mingdao Wang, Meitong Guo, Tianxiao Li, Li Lin, Yuwang Wang

TL;DR

This work tackles the lack of large-scale, richly annotated 3D indoor layout data for text-driven generation by introducing M3DLayout, a 21k-layout, 433k-object dataset drawn from real scans, professional CAD designs, and procedurally generated scenes, each paired with structured textual descriptions. It develops two model families—diffusion-based and autoregressive—that learn to generate 3D layouts conditioned on text, and benchmarks them against state-of-the-art baselines, with notable gains in diversity, fidelity, and controllability, especially when leveraging the Inf3DLayout subset. A supplementary object-retrieval pipeline enables mapping generated layouts to 3D assets for realistic rendering and evaluation. While promising, the authors acknowledge potential noise in language-generated annotations and call for public release to foster broader validation and extension of text-driven 3D scene synthesis.

Abstract

In text-driven 3D scene generation, object layout serves as a crucial intermediate representation that bridges high-level language instructions with detailed geometric output. It not only provides a structural blueprint for ensuring physical plausibility but also supports semantic controllability and interactive editing. However, the learning capabilities of current 3D indoor layout generation models are constrained by the limited scale, diversity, and annotation quality of existing datasets. To address this, we introduce M3DLayout, a large-scale, multi-source dataset for 3D indoor layout generation. M3DLayout comprises 21,367 layouts and over 433k object instances, integrating three distinct sources: real-world scans, professional CAD designs, and procedurally generated scenes. Each layout is paired with detailed structured text describing global scene summaries, relational placements of large furniture, and fine-grained arrangements of smaller items. This diverse and richly annotated resource enables models to learn complex spatial and semantic patterns across a wide variety of indoor environments. To assess the potential of M3DLayout, we establish a benchmark using both a text-conditioned diffusion model and a text-conditioned autoregressive model. Experimental results demonstrate that our dataset provides a solid foundation for training layout generation models. Its multi-source composition enhances diversity, notably through the Inf3DLayout subset which provides rich small-object information, enabling the generation of more complex and detailed scenes. We hope that M3DLayout can serve as a valuable resource for advancing research in text-driven 3D scene synthesis. All dataset and code will be made public upon acceptance.

M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation

TL;DR

This work tackles the lack of large-scale, richly annotated 3D indoor layout data for text-driven generation by introducing M3DLayout, a 21k-layout, 433k-object dataset drawn from real scans, professional CAD designs, and procedurally generated scenes, each paired with structured textual descriptions. It develops two model families—diffusion-based and autoregressive—that learn to generate 3D layouts conditioned on text, and benchmarks them against state-of-the-art baselines, with notable gains in diversity, fidelity, and controllability, especially when leveraging the Inf3DLayout subset. A supplementary object-retrieval pipeline enables mapping generated layouts to 3D assets for realistic rendering and evaluation. While promising, the authors acknowledge potential noise in language-generated annotations and call for public release to foster broader validation and extension of text-driven 3D scene synthesis.

Abstract

In text-driven 3D scene generation, object layout serves as a crucial intermediate representation that bridges high-level language instructions with detailed geometric output. It not only provides a structural blueprint for ensuring physical plausibility but also supports semantic controllability and interactive editing. However, the learning capabilities of current 3D indoor layout generation models are constrained by the limited scale, diversity, and annotation quality of existing datasets. To address this, we introduce M3DLayout, a large-scale, multi-source dataset for 3D indoor layout generation. M3DLayout comprises 21,367 layouts and over 433k object instances, integrating three distinct sources: real-world scans, professional CAD designs, and procedurally generated scenes. Each layout is paired with detailed structured text describing global scene summaries, relational placements of large furniture, and fine-grained arrangements of smaller items. This diverse and richly annotated resource enables models to learn complex spatial and semantic patterns across a wide variety of indoor environments. To assess the potential of M3DLayout, we establish a benchmark using both a text-conditioned diffusion model and a text-conditioned autoregressive model. Experimental results demonstrate that our dataset provides a solid foundation for training layout generation models. Its multi-source composition enhances diversity, notably through the Inf3DLayout subset which provides rich small-object information, enabling the generation of more complex and detailed scenes. We hope that M3DLayout can serve as a valuable resource for advancing research in text-driven 3D scene synthesis. All dataset and code will be made public upon acceptance.

Paper Structure

This paper contains 34 sections, 4 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: The M3DLayout dataset — A multi-source benchmark for text-to-3D indoor scene generation. Top: An example from our dataset showing a detailed 3D indoor layout with richly annotated bounding boxes and its corresponding structured textual description. Bottom-left: Word cloud visualization demonstrating the diversity of room types, furniture, and objects in the dataset. Bottom-right: Overview of the large-scale collection containing 21,367 diverse 3D layout scenes with various styles.
  • Figure 2: Pipeline for Constructing the M3DLayout Dataset. Our framework integrates multi-source data, including the professional designs dataset 3D-FRONT, real-world scans from Matterport3D, and procedurally generated scenes from Infinigen. The construction process involves: meticulously generating, partitioning, and filtering layouts to create the Inf3DLayout subset; performing template-based rules to produce formatted text; and employing global and local rendering for vision-language models (VLM) to produce structured descriptions. This pipeline results in a large-scale, richly-annotated text-3D layout paired dataset.
  • Figure 3: Dataset statistics of M3DLayout. (a) Top 15 most frequent object categories. (b) Distribution of the number of objects per scene. (c) Proportion of scenes contributed by each source.
  • Figure 4: Qualitative comparison of different methods on diverse room types. From top to bottom: bedroom, dining room, and living room generation results. Each row shows the input prompt and generated layouts from Diffuscene, Instructscene, and our method. Trained on the M3DLayout dataset, our method produces richer layout details from text descriptions.
  • Figure 5: Density controllability in layout generation with different input texts. The first row presents input prompts for our layout generation model, showcasing variations in objects density from low to high, with minor changes in the last sentence. The second row illustrates the corresponding output results generated by our model, which adapt based on the prompt density.
  • ...and 8 more figures