Style-Consistent 3D Indoor Scene Synthesis with Decoupled Objects
Yunfan Zhang, Hong Huang, Zhiwei Xiong, Zhiqi Shen, Guosheng Lin, Hao Wang, Nicholas Vun
TL;DR
This work tackles controllable 3D indoor scene synthesis by decoupling object geometry and appearance through per-object meshes and bounding-box-guided placement. It combines diffusion-based text/image generation, depth-aware inpainting, and CLIP-guided cross-attention to achieve consistent styling across multiple objects, with a cascade stylization process that conditions each object on global scene guidance and prior objects. By leveraging single-view mesh reconstruction (SyncDreamer) and language-driven placement (ChatGPT), the pipeline produces photorealistic, multi-view-consistent scenes that respect user prompts and stylistic constraints. The approach demonstrates superior style coherence and controllability on the 3D-FRONT dataset, outperforming baselines in both qualitative and quantitative evaluations and offering practical gains for AR/VR, gaming, and film workflows.
Abstract
Controllable 3D indoor scene synthesis stands at the forefront of technological progress, offering various applications like gaming, film, and augmented/virtual reality. The capability to stylize and de-couple objects within these scenarios is a crucial factor, providing an advanced level of control throughout the editing process. This control extends not just to manipulating geometric attributes like translation and scaling but also includes managing appearances, such as stylization. Current methods for scene stylization are limited to applying styles to the entire scene, without the ability to separate and customize individual objects. Addressing the intricacies of this challenge, we introduce a unique pipeline designed for synthesis 3D indoor scenes. Our approach involves strategically placing objects within the scene, utilizing information from professionally designed bounding boxes. Significantly, our pipeline prioritizes maintaining style consistency across multiple objects within the scene, ensuring a cohesive and visually appealing result aligned with the desired aesthetic. The core strength of our pipeline lies in its ability to generate 3D scenes that are not only visually impressive but also exhibit features like photorealism, multi-view consistency, and diversity. These scenes are crafted in response to various natural language prompts, demonstrating the versatility and adaptability of our model.
