Text To 3D Object Generation For Scalable Room Assembly
Sonia Laguna, Alberto Garcia-Garcia, Marie-Julie Rakotosaona, Stylianos Moschoglou, Leonhard Helminger, Sergio Orts-Escolano
TL;DR
The paper tackles the shortage of labeled 3D indoor data for perception tasks. It proposes an end-to-end pipeline that fuses automated prompt engineering, text-to-image diffusion, CAT3D multi-view diffusion, NeRF-based reconstruction, and NerfMeshing to generate on-demand 3D assets and integrate them into floor plans. Key contributions include (i) scalable prompt generation for diverse assets, (ii) targeted improvements to diffusion, segmentation, and 3D supervision losses, and (iii) an end-to-end system capable of placing generated objects into artist-defined rooms. The approach yields high geometric fidelity and cross-view consistency, enabling scalable synthetic data for robust ML in indoor perception while enabling faster asset generation than prior mesh-based methods.
Abstract
Modern machine learning models for scene understanding, such as depth estimation and object tracking, rely on large, high-quality datasets that mimic real-world deployment scenarios. To address data scarcity, we propose an end-to-end system for synthetic data generation for scalable, high-quality, and customizable 3D indoor scenes. By integrating and adapting text-to-image and multi-view diffusion models with Neural Radiance Field-based meshing, this system generates highfidelity 3D object assets from text prompts and incorporates them into pre-defined floor plans using a rendering tool. By introducing novel loss functions and training strategies into existing methods, the system supports on-demand scene generation, aiming to alleviate the scarcity of current available data, generally manually crafted by artists. This system advances the role of synthetic data in addressing machine learning training limitations, enabling more robust and generalizable models for real-world applications.
