Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

Xiaolong Li; Jiawei Mo; Ying Wang; Chethan Parameshwara; Xiaohan Fei; Ashwin Swaminathan; CJ Taylor; Zhuowen Tu; Paolo Favaro; Stefano Soatto

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

Xiaolong Li, Jiawei Mo, Ying Wang, Chethan Parameshwara, Xiaohan Fei, Ashwin Swaminathan, CJ Taylor, Zhuowen Tu, Paolo Favaro, Stefano Soatto

TL;DR

Grounded-Dreamer tackles compositional text-to-3D synthesis by coupling a guided four-view generation stage with a diffusion-prior enhanced NeRF refinement. The first stage uses attention refocusing to produce coherent, text-aligned four-view images from a pre-trained multi-view diffusion model, while the second stage performs coarse-to-fine NeRF reconstruction guided by sparse-view supervision and a warm-started SDS loss to achieve high fidelity. Empirical results show improvements in compositional accuracy and text-image alignment over state-of-the-art baselines, with the ability to generate diverse 3D assets from the same prompt and without re-training the diffusion model. The method achieves a favorable balance between quality and efficiency, addressing common failure modes such as Janus-like distortions and incomplete compositional priors. This approach advances scalable, grounded 3D content creation from natural language prompts with practical implications for content creation pipelines and interactive design.

Abstract

In this paper, we propose an effective two-stage approach named Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts while achieving high fidelity by using a pre-trained multi-view diffusion model. Multi-view diffusion models, such as MVDream, have shown to generate high-fidelity 3D assets using score distillation sampling (SDS). However, applied naively, these methods often fail to comprehend compositional text prompts, and may often entirely omit certain subjects or parts. To address this issue, we first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline. We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation, without the necessity to re-train the multi-view diffusion model or craft a high-quality compositional 3D dataset. We further propose a hybrid optimization strategy to encourage synergy between the SDS loss and the sparse RGB reference images. Our method consistently outperforms previous state-of-the-art (SOTA) methods in generating compositional 3D assets, excelling in both quality and accuracy, and enabling diverse 3D from the same text prompt.

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

TL;DR

Abstract

Paper Structure (33 sections, 3 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 33 sections, 3 equations, 10 figures, 5 tables, 1 algorithm.

Introduction
Related Works
Text-to-3D
Sparse Image-to-3D with Diffusion Models
3D Compositional Generation
Preliminaries
Attend-and-Excite Revisited
Multi-View Diffusion Models
Method
Attention Refocusing for Accurate Compositional 4-View Generation
Coarse-to-Fine Synergistic Reconstruction With Diffusion Priors
Early few-shot NeRF training
3D distillation with warm-start SDS loss
Experiments
Experimental Setup
...and 18 more sections

Figures (10)

Figure 1: Illustration of common failure patterns when naively combining sparse reference images to the SDS loss. Text prompt: "Two foxes fighting". When combining 4 reference images, (a) ends up generating a fox with two tails, while (b) misses the 'two' information.
Figure 2: Illustration of the two-stage pipeline with our Grounded-Dreamer. Given a text prompt, we first generate compositionally correct 4-view images using iterative latent optimization at selected DDIM sampling steps. The 4-view reference images together with the masks are combined with score distillation sampling (SDS) loss in our hybrid training strategy, which will create high fidelity 3D assets while preserving the compositional priors accurately.
Figure 3: Illustration on the 2nd-stage training progress with Grounded-Dreamer . Here we are showing a fixed front-view rendering of the target NeRF at different optimization steps. Our method can gradually create high fidelity 3D assets while preserving the compositional priors accurately.
Figure 4: 4-view generation, each pair uses the same random seed. Our inference-stage optimization encourages compositionally correct 4-view generation compared to the original MVDream.
Figure 5: Qualitative results comparison for compositional Text-to-3D. From top to down, the methods are: our Grounded-Dreamer, MVDream shi2023mvdream, Magic123 qian2023magic123, Wonder3D long2023wonder3d, Magic3D lin2023magic3d, ProlificDreamer wang2023prolificdreamer. Our method generates more compositionally complete views with high quality. Text prompts: (a) "a zoomed out DSLR photo of an adorable kitten lying next to a flower", (b) "a zoomed out DSLR photo of a beagle eating a donut", (c) "a zoomed out DSLR photo of a chimpanzee holding a cup of hot coffee", (d) "a blue candle on a red cake in a yellow tray", (e) "a lego tank with a golden gun and a red flying flag", (f) "a model of a silver house with a golden roof beside an origami coconut tree".
...and 5 more figures

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

TL;DR

Abstract

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

Authors

TL;DR

Abstract

Table of Contents

Figures (10)