BoostDream: Efficient Refining for High-Quality Text-to-3D Generation from Multi-View Diffusion

Yonghao Yu; Shunan Zhu; Huai Qin; Haorui Li

BoostDream: Efficient Refining for High-Quality Text-to-3D Generation from Multi-View Diffusion

Yonghao Yu, Shunan Zhu, Huai Qin, Haorui Li

TL;DR

BoostDream presents a three-stage, plug-and-play refinement framework that accelerates high-quality text-to-3D generation by marrying fast feed-forward initialization with multi-view diffusion-based refinement. It introduces a novel 3D representation distillation, a multi-view render system, and a multi-view SDS loss (MV-SDS) with normal-map guidance and orientation/opacity terms, enabling robust refinement across NeRF, DMTet, and 3D Gaussian Splatting representations. The approach addresses the Janus problem, enhances detail through self-guided refinement, and significantly reduces training iterations compared to SDS-only methods, confirmed by extensive experiments, ablations, and user studies. This work offers a practical path to efficient, high-fidelity 3D asset generation suitable for VR, gaming, and related industries while generalizing across diverse differentiable 3D representations.

Abstract

Witnessing the evolution of text-to-image diffusion models, significant strides have been made in text-to-3D generation. Currently, two primary paradigms dominate the field of text-to-3D: the feed-forward generation solutions, capable of swiftly producing 3D assets but often yielding coarse results, and the Score Distillation Sampling (SDS) based solutions, known for generating high-fidelity 3D assets albeit at a slower pace. The synergistic integration of these methods holds substantial promise for advancing 3D generation techniques. In this paper, we present BoostDream, a highly efficient plug-and-play 3D refining method designed to transform coarse 3D assets into high-quality. The BoostDream framework comprises three distinct processes: (1) We introduce 3D model distillation that fits differentiable representations from the 3D assets obtained through feed-forward generation. (2) A novel multi-view SDS loss is designed, which utilizes a multi-view aware 2D diffusion model to refine the 3D assets. (3) We propose to use prompt and multi-view consistent normal maps as guidance in refinement.Our extensive experiment is conducted on different differentiable 3D representations, revealing that BoostDream excels in generating high-quality 3D assets rapidly, overcoming the Janus problem compared to conventional SDS-based methods. This breakthrough signifies a substantial advancement in both the efficiency and quality of 3D generation processes.

BoostDream: Efficient Refining for High-Quality Text-to-3D Generation from Multi-View Diffusion

TL;DR

Abstract

Paper Structure (21 sections, 13 equations, 10 figures, 1 table)

This paper contains 21 sections, 13 equations, 10 figures, 1 table.

Introduction
Related Work
Feed-Forward Generation Method
SDS-Based Optimization Generation Method
Multi-View in 3D Generation
BoostDream
Background
3D Representation Initialization
Multi-View Render System
Multi-View SDS
Experiments
Implementation Details
Refinement Experiment
Comparison Experiment
Ablation Study
...and 6 more sections

Figures (10)

Figure 1: Comparison of 3D Generation Results of baseline and BoostDream. Provided with a coarse 3D asset and text prompt pair, BoostDream can refine it into a high-quality 3D asset efficiently. In each set of images, the image on the left is the coarse 3D asset generated by Shap-E jun2023shape, and the three images on the right are our refined 3D asset.
Figure 2: Overview of the proposed BoostDream. BoostDream is a three-stage framework for refining a coarse 3D asset into a high-quality 3D asset. In the initialization stage, we use the feed-forward generation method to get a coarse 3D asset and fit it into differentiable 3D representations to make it trainable. The boost stage is guided by the multi-view normal maps of the coarse 3D asset to ensure stability from the beginning of the refining stage, and the self-boost stage is guided by its own multi-view normal maps to generate 3D assets with more detail and higher quality.
Figure 3: The first column is the Shap-E jun2023shape results and the remaining column is the refined results of our method. The results show that BoostDream can refine and edit 3D assets according to different prompts based on input 3D assets.
Figure 4: Comparision with Shap-E jun2023shape, DreamFusion poole2022dreamfusion and Magic3D lin2023magic3d for the same text-to-3D generation task. Our model has significantly stronger prompt relevancy and much better quality (best viewed by zooming in). See the results of our method on DMTet shen2021dmtet and 3D Gaussian Splatting kerbl3Dgaussians in the Appendix Boost2024
Figure 5: Ablation study. Fig(a) is without the initialization stage. Fig(b) is without the boost stage. Fig(c) is without the self-boost stage. Fig(d) is our complete BoostDream method.
...and 5 more figures

BoostDream: Efficient Refining for High-Quality Text-to-3D Generation from Multi-View Diffusion

TL;DR

Abstract

BoostDream: Efficient Refining for High-Quality Text-to-3D Generation from Multi-View Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (10)