Table of Contents
Fetching ...

CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout

Haotian Bai, Yuanhuiyi Lyu, Lutao Jiang, Sijia Li, Haonan Lu, Xiaodong Lin, Lin Wang

TL;DR

CompoNeRF tackles the challenge of generating coherent multi-object 3D scenes from text by introducing an editable 3D scene layout and a modular, object-centric NeRF framework. Objects are represented by individual NeRFs with local prompts and are composited into a global scene via a dedicated composition module, with dual guidance from global and local text prompts to mitigate guidance collapse. The method supports recomposition through caching of decomposed NeRFs and layout editing, enabling flexible scene editing and rapid generation of complex scenes. Experiments show substantial gains in multi-view CLIP alignment (up to 54%) and user studies indicate improved semantic accuracy and consistency, demonstrating practical value for AR/VR content creation. Overall, CompoNeRF advances text-to-3D by enabling precise object control, repeatable recomposition, and robust scene coherence in multi-object settings.

Abstract

Text-to-3D form plays a crucial role in creating editable 3D scenes for AR/VR. Recent advances have shown promise in merging neural radiance fields (NeRFs) with pre-trained diffusion models for text-to-3D object generation. However, one enduring challenge is their inadequate capability to accurately parse and regenerate consistent multi-object environments. Specifically, these models encounter difficulties in accurately representing quantity and style prompted by multi-object texts, often resulting in a collapse of the rendering fidelity that fails to match the semantic intricacies. Moreover, amalgamating these elements into a coherent 3D scene is a substantial challenge, stemming from generic distribution inherent in diffusion models. To tackle the issue of 'guidance collapse' and further enhance scene consistency, we propose a novel framework, dubbed CompoNeRF, by integrating an editable 3D scene layout with object-specific and scene-wide guidance mechanisms. It initiates by interpreting a complex text into the layout populated with multiple NeRFs, each paired with a corresponding subtext prompt for precise object depiction. Next, a tailored composition module seamlessly blends these NeRFs, promoting consistency, while the dual-level text guidance reduces ambiguity and boosts accuracy. Noticeably, our composition design permits decomposition. This enables flexible scene editing and recomposition into new scenes based on the edited layout or text prompts. Utilizing the open-source Stable Diffusion model, CompoNeRF generates multi-object scenes with high fidelity. Remarkably, our framework achieves up to a \textbf{54\%} improvement by the multi-view CLIP score metric. Our user study indicates that our method has significantly improved semantic accuracy, multi-view consistency, and individual recognizability for multi-object scene generation.

CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout

TL;DR

CompoNeRF tackles the challenge of generating coherent multi-object 3D scenes from text by introducing an editable 3D scene layout and a modular, object-centric NeRF framework. Objects are represented by individual NeRFs with local prompts and are composited into a global scene via a dedicated composition module, with dual guidance from global and local text prompts to mitigate guidance collapse. The method supports recomposition through caching of decomposed NeRFs and layout editing, enabling flexible scene editing and rapid generation of complex scenes. Experiments show substantial gains in multi-view CLIP alignment (up to 54%) and user studies indicate improved semantic accuracy and consistency, demonstrating practical value for AR/VR content creation. Overall, CompoNeRF advances text-to-3D by enabling precise object control, repeatable recomposition, and robust scene coherence in multi-object settings.

Abstract

Text-to-3D form plays a crucial role in creating editable 3D scenes for AR/VR. Recent advances have shown promise in merging neural radiance fields (NeRFs) with pre-trained diffusion models for text-to-3D object generation. However, one enduring challenge is their inadequate capability to accurately parse and regenerate consistent multi-object environments. Specifically, these models encounter difficulties in accurately representing quantity and style prompted by multi-object texts, often resulting in a collapse of the rendering fidelity that fails to match the semantic intricacies. Moreover, amalgamating these elements into a coherent 3D scene is a substantial challenge, stemming from generic distribution inherent in diffusion models. To tackle the issue of 'guidance collapse' and further enhance scene consistency, we propose a novel framework, dubbed CompoNeRF, by integrating an editable 3D scene layout with object-specific and scene-wide guidance mechanisms. It initiates by interpreting a complex text into the layout populated with multiple NeRFs, each paired with a corresponding subtext prompt for precise object depiction. Next, a tailored composition module seamlessly blends these NeRFs, promoting consistency, while the dual-level text guidance reduces ambiguity and boosts accuracy. Noticeably, our composition design permits decomposition. This enables flexible scene editing and recomposition into new scenes based on the edited layout or text prompts. Utilizing the open-source Stable Diffusion model, CompoNeRF generates multi-object scenes with high fidelity. Remarkably, our framework achieves up to a \textbf{54\%} improvement by the multi-view CLIP score metric. Our user study indicates that our method has significantly improved semantic accuracy, multi-view consistency, and individual recognizability for multi-object scene generation.
Paper Structure (24 sections, 6 equations, 22 figures, 3 tables)

This paper contains 24 sections, 6 equations, 22 figures, 3 tables.

Figures (22)

  • Figure 1: The guidance collapse issue. (a) Generation of the multi-object scene involves utilizing the frozen Stable Diffusion. (b) Instances of guidance collapse are observed when using the global text directly. (c) Comparison of rendering results .
  • Figure 2: (a) CompoNeRF supports cashing and loading to facilitate NeRF composition. (b) The composition module composites multiple NeRFs for coherent scenes. Its enhanced effect is accentuated by the red boxes, showcasing superior scene coherency.
  • Figure 3: Framework Overview. The CompoNeRF model unfolds in three stages: 1) Editing 3D scene, which initiates the process by structuring the scene with 3D boxes and textual prompts; 2) Scene rendering, which encapsulates the composition/recomposition process, facilitating the transformation of NeRFs to a global frame, ensuring cohesive scene construction. Here, we specify design choices between density-based or color-based(without refining density) composition; 3) Joint Optimization, which leverages textual directives to amplify the rendering quality of both global and local views, while also integrating revised text prompts and NeRFs for refined scene depiction.
  • Figure 4: Design Impact Comparison: Density vs. Color-based Methods. The top row illustrates the density-based approach's detailed rendering and quick convergence in the 'table wine' scene. The bottom row highlights the color-based method's enhancements and its drawbacks, such as geometric and shadow inaccuracies, particularly in close-up views and slow convergence.
  • Figure 5: Detail of Composition module: density-based design.
  • ...and 17 more figures