Table of Contents
Fetching ...

Style-Consistent 3D Indoor Scene Synthesis with Decoupled Objects

Yunfan Zhang, Hong Huang, Zhiwei Xiong, Zhiqi Shen, Guosheng Lin, Hao Wang, Nicholas Vun

TL;DR

This work tackles controllable 3D indoor scene synthesis by decoupling object geometry and appearance through per-object meshes and bounding-box-guided placement. It combines diffusion-based text/image generation, depth-aware inpainting, and CLIP-guided cross-attention to achieve consistent styling across multiple objects, with a cascade stylization process that conditions each object on global scene guidance and prior objects. By leveraging single-view mesh reconstruction (SyncDreamer) and language-driven placement (ChatGPT), the pipeline produces photorealistic, multi-view-consistent scenes that respect user prompts and stylistic constraints. The approach demonstrates superior style coherence and controllability on the 3D-FRONT dataset, outperforming baselines in both qualitative and quantitative evaluations and offering practical gains for AR/VR, gaming, and film workflows.

Abstract

Controllable 3D indoor scene synthesis stands at the forefront of technological progress, offering various applications like gaming, film, and augmented/virtual reality. The capability to stylize and de-couple objects within these scenarios is a crucial factor, providing an advanced level of control throughout the editing process. This control extends not just to manipulating geometric attributes like translation and scaling but also includes managing appearances, such as stylization. Current methods for scene stylization are limited to applying styles to the entire scene, without the ability to separate and customize individual objects. Addressing the intricacies of this challenge, we introduce a unique pipeline designed for synthesis 3D indoor scenes. Our approach involves strategically placing objects within the scene, utilizing information from professionally designed bounding boxes. Significantly, our pipeline prioritizes maintaining style consistency across multiple objects within the scene, ensuring a cohesive and visually appealing result aligned with the desired aesthetic. The core strength of our pipeline lies in its ability to generate 3D scenes that are not only visually impressive but also exhibit features like photorealism, multi-view consistency, and diversity. These scenes are crafted in response to various natural language prompts, demonstrating the versatility and adaptability of our model.

Style-Consistent 3D Indoor Scene Synthesis with Decoupled Objects

TL;DR

This work tackles controllable 3D indoor scene synthesis by decoupling object geometry and appearance through per-object meshes and bounding-box-guided placement. It combines diffusion-based text/image generation, depth-aware inpainting, and CLIP-guided cross-attention to achieve consistent styling across multiple objects, with a cascade stylization process that conditions each object on global scene guidance and prior objects. By leveraging single-view mesh reconstruction (SyncDreamer) and language-driven placement (ChatGPT), the pipeline produces photorealistic, multi-view-consistent scenes that respect user prompts and stylistic constraints. The approach demonstrates superior style coherence and controllability on the 3D-FRONT dataset, outperforming baselines in both qualitative and quantitative evaluations and offering practical gains for AR/VR, gaming, and film workflows.

Abstract

Controllable 3D indoor scene synthesis stands at the forefront of technological progress, offering various applications like gaming, film, and augmented/virtual reality. The capability to stylize and de-couple objects within these scenarios is a crucial factor, providing an advanced level of control throughout the editing process. This control extends not just to manipulating geometric attributes like translation and scaling but also includes managing appearances, such as stylization. Current methods for scene stylization are limited to applying styles to the entire scene, without the ability to separate and customize individual objects. Addressing the intricacies of this challenge, we introduce a unique pipeline designed for synthesis 3D indoor scenes. Our approach involves strategically placing objects within the scene, utilizing information from professionally designed bounding boxes. Significantly, our pipeline prioritizes maintaining style consistency across multiple objects within the scene, ensuring a cohesive and visually appealing result aligned with the desired aesthetic. The core strength of our pipeline lies in its ability to generate 3D scenes that are not only visually impressive but also exhibit features like photorealism, multi-view consistency, and diversity. These scenes are crafted in response to various natural language prompts, demonstrating the versatility and adaptability of our model.
Paper Structure (26 sections, 5 equations, 7 figures, 1 table)

This paper contains 26 sections, 5 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: The synthesized stylized 3D indoor scenes. The first column depicts the living room and bedroom using the text prompt Chinese Style; The second column depicts the living room and bedroom using the text prompt Muji Style; The last column depicts the living room of Galaxy style and bedroom of the Starry Night Style, in which the image prompts are used.
  • Figure 2: Model Pipeline: Our pipeline starts by sampling objects either user specified or reconstructed from a single-view image provided by the user. Secondly, the text prompt containing style information is used to generate a styled reference scene image as global guidance. The prompt is also used to control the viewpoint-dependent stylized texturization iteratively. What's more, the previous textured mesh is used to supervise the following mesh texturization. The whole texturization is in a cascaded manner to achieve the multi-object style consistency. Subsequently, the objects are positioned and scaled within the scene based on ChatGPT learnt positions reasoning. Finally, the final scene is composed. The result could be visualized by rendering the resultant mesh using the specified camera pose.
  • Figure 3: The reconstructed meshes from the single-view images of a wooden chair, a bed, a small cabinet and a sofa.
  • Figure 4: The diverse stylized 3D indoor scene synthesis using different prompts. The figures in first row depict the typical living room scenes and those in the second row are the bedrooms. The figures in the first column is conditioned by Chinese Style. The objects and camera view are different from what is shown in \ref{['fig:teaser']}. The figures in the second and third column are conditioned by Muji Style and Modern Light Luxury Style. The placements and geometries of these objects are the exactly same whereas the styles are totally different, meaning our pipeline can fully de-couple geometry and appearance.
  • Figure 5: The stylized mesh guided by prompt (first column), prompt and whole scene images (second column), prompt and object level images (third column) and the whole scene images and object level images (last column).
  • ...and 2 more figures