Table of Contents
Fetching ...

Text To 3D Object Generation For Scalable Room Assembly

Sonia Laguna, Alberto Garcia-Garcia, Marie-Julie Rakotosaona, Stylianos Moschoglou, Leonhard Helminger, Sergio Orts-Escolano

TL;DR

The paper tackles the shortage of labeled 3D indoor data for perception tasks. It proposes an end-to-end pipeline that fuses automated prompt engineering, text-to-image diffusion, CAT3D multi-view diffusion, NeRF-based reconstruction, and NerfMeshing to generate on-demand 3D assets and integrate them into floor plans. Key contributions include (i) scalable prompt generation for diverse assets, (ii) targeted improvements to diffusion, segmentation, and 3D supervision losses, and (iii) an end-to-end system capable of placing generated objects into artist-defined rooms. The approach yields high geometric fidelity and cross-view consistency, enabling scalable synthetic data for robust ML in indoor perception while enabling faster asset generation than prior mesh-based methods.

Abstract

Modern machine learning models for scene understanding, such as depth estimation and object tracking, rely on large, high-quality datasets that mimic real-world deployment scenarios. To address data scarcity, we propose an end-to-end system for synthetic data generation for scalable, high-quality, and customizable 3D indoor scenes. By integrating and adapting text-to-image and multi-view diffusion models with Neural Radiance Field-based meshing, this system generates highfidelity 3D object assets from text prompts and incorporates them into pre-defined floor plans using a rendering tool. By introducing novel loss functions and training strategies into existing methods, the system supports on-demand scene generation, aiming to alleviate the scarcity of current available data, generally manually crafted by artists. This system advances the role of synthetic data in addressing machine learning training limitations, enabling more robust and generalizable models for real-world applications.

Text To 3D Object Generation For Scalable Room Assembly

TL;DR

The paper tackles the shortage of labeled 3D indoor data for perception tasks. It proposes an end-to-end pipeline that fuses automated prompt engineering, text-to-image diffusion, CAT3D multi-view diffusion, NeRF-based reconstruction, and NerfMeshing to generate on-demand 3D assets and integrate them into floor plans. Key contributions include (i) scalable prompt generation for diverse assets, (ii) targeted improvements to diffusion, segmentation, and 3D supervision losses, and (iii) an end-to-end system capable of placing generated objects into artist-defined rooms. The approach yields high geometric fidelity and cross-view consistency, enabling scalable synthetic data for robust ML in indoor perception while enabling faster asset generation than prior mesh-based methods.

Abstract

Modern machine learning models for scene understanding, such as depth estimation and object tracking, rely on large, high-quality datasets that mimic real-world deployment scenarios. To address data scarcity, we propose an end-to-end system for synthetic data generation for scalable, high-quality, and customizable 3D indoor scenes. By integrating and adapting text-to-image and multi-view diffusion models with Neural Radiance Field-based meshing, this system generates highfidelity 3D object assets from text prompts and incorporates them into pre-defined floor plans using a rendering tool. By introducing novel loss functions and training strategies into existing methods, the system supports on-demand scene generation, aiming to alleviate the scarcity of current available data, generally manually crafted by artists. This system advances the role of synthetic data in addressing machine learning training limitations, enabling more robust and generalizable models for real-world applications.

Paper Structure

This paper contains 25 sections, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: Overview of the proposed system from scalable prompt generation (1)-Section \ref{['sec:methods1']}, through text-to-image diffusion (2)-Section \ref{['sec:methods2']}, multi-view latent diffusion (3)-Section \ref{['sec:methods3']}, and NeRF (4)-Section \ref{['sec:methods4']}, onto the resulting meshes (5)-Section \ref{['sec:methods5']} integrated in existing rooms (6)-Section \ref{['sec:methods6']}.
  • Figure 2: Normal estimation and its use in the NeRF, expanding on Steps 3 and 4 from Figure \ref{['fig:schema']}.
  • Figure 3: Qualitative evaluation of the contributions of our system to the different subcomponents. (a) Impact of object-based adaptations including segmentation, fine-tuning of the LDM, and density regularization. (b) Effect of NeRF and NeRFMeshing adaptations focusing on normal regularization. In particular, adding normal smoothness, orientation loss, and normal supervision.
  • Figure 4: Synthetic 3D room variations generated by the system. From the original room (top left), annotated objects are replaced by generated semantic equivalents to produce room permutations.
  • Figure 5: Samples of the effect of the contextual precision in the generated outputs.
  • ...and 1 more figures