VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

Zixuan Chen; Ruijie Su; Jiahao Zhu; Lingxiao Yang; Jian-Huang Lai; Xiaohua Xie

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

Zixuan Chen, Ruijie Su, Jiahao Zhu, Lingxiao Yang, Jian-Huang Lai, Xiaohua Xie

TL;DR

Pose-dependent Consistency Distillation Sampling (PCDS), a novel yet efficient objective for diffusion-based 3D generation tasks that builds the pose-dependent consistency function within diffusion trajectories, allowing to approximate true gradients through minimal sampling steps (1-3).

Abstract

Text-to-3D generation aims to create 3D assets from text-to-image diffusion models. However, existing methods face an inherent bottleneck in generation quality because the widely-used objectives such as Score Distillation Sampling (SDS) inappropriately omit U-Net jacobians for swift generation, leading to significant bias compared to the "true" gradient obtained by full denoising sampling. This bias brings inconsistent updating direction, resulting in implausible 3D generation e.g., color deviation, Janus problem, and semantically inconsistent details). In this work, we propose Pose-dependent Consistency Distillation Sampling (PCDS), a novel yet efficient objective for diffusion-based 3D generation tasks. Specifically, PCDS builds the pose-dependent consistency function within diffusion trajectories, allowing to approximate true gradients through minimal sampling steps (1-3). Compared to SDS, PCDS can acquire a more accurate updating direction with the same sampling time (1 sampling step), while enabling few-step (2-3) sampling to trade compute for higher generation quality. For efficient generation, we propose a coarse-to-fine optimization strategy, which first utilizes 1-step PCDS to create the basic structure of 3D objects, and then gradually increases PCDS steps to generate fine-grained details. Extensive experiments demonstrate that our approach outperforms the state-of-the-art in generation quality and training efficiency, conspicuously alleviating the implausible 3D generation issues caused by the deviated updating direction. Moreover, it can be simply applied to many 3D generative applications to yield impressive 3D assets, please see our project page: https://narcissusex.github.io/VividDreamer.

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

TL;DR

Abstract

Paper Structure (27 sections, 16 equations, 10 figures, 1 table, 1 algorithm)

This paper contains 27 sections, 16 equations, 10 figures, 1 table, 1 algorithm.

Introduction
Related Works
Text-to-Image Diffusion Models
Differentiable 3D Representations
Text-to-3D Generation Models
Preliminaries
Diffusion Models
3D Gaussian Splatting
Methods
Review of SDS and ISM.
Pose-dependent Consistency Distillation Sampling
Coarse-to-Fine Optimization
Advanced Generation Pipeline
Efficient Initialization
ControlNet-based Enhancement
...and 12 more sections

Figures (10)

Figure 1: Examples of text-to-3D asset creations with our framework (a). We present an efficient text-to-3D generation framework -- VividDreamer that can distill semantically-consistent textures and high-fidelity structures from pretrained 2D diffusion models using a novel Pose-dependent Consistency Distillation Sampling objective in a coarse-to-fine optimization manner, allowing to yield high-fidelity 3D objects (rows 1 and 2) and 3D avatars (row 3) based on the given text prompts. Specifically, our VividDreamer achieves high training efficiency, which can create ready-to-use 3D assets within 10 minutes, while producing photorealistic 3D objects within 30 minutes (b). More results can be found in \ref{['fig:more']} and our https://narcissusex.github.io/VividDreamer.
Figure 2: An overview of VividDreamer. We employ 3D Gaussian Splatting (3DGS) 3dgs as 3D representation, and initialize it using the pre-trained Point-E point-e with given text prompts. In training, given a camera pose $c$, we render the corresponding view $x_0=g(\theta,c)$ by the rendering pipeline of 3DGS, and disturb it to 2D diffusion models using DDPM/DDIM inversion. Then, we employ the proposed Pose-dependent Consistency Distillation Sampling (PCDS) to map noise $x_t$ to the pseudoGTs$\tilde{x}^t_0$ (i.e., the denoised images) through few-step (1$\sim$3) sampling. Finally, we calculate the Mean Square Error (MSE) loss $\mathcal{L}_{PCDS}$ between the rendered views $x_0$ and pseudoGTs$\tilde{x}^t_0$, and update the parameter of 3D Gaussians $\theta$ by the gradients $\bigtriangledown_\theta\mathcal{L}_{PCDS}$ in \ref{['eq:cds_consistency']}.
Figure 3: Examples of different objectives. Visually, the acquisition of "true" gradient (a) is time-consuming work, requiring the full denoising sampling in each iteration. To skip such a lengthy process, Score Distillation Sampling (SDS) poole2022dreamfusion(b) directly maps the noise to data i.e., pseudoGTs using 1-step DDPM sampling, but SDS struggles to acquire accurate gradients due to the intrinsic randomness brought by DDPM. On the contrary, our PCDS builds the pose-dependent consistency function$f_\phi$ from any timestep $t$ to the origin $0$ within diffusion trajectories, allowing to generate accurate pseudoGTs and acquire precise gradients via minimal sampling steps (1$\sim$3).
Figure 4: Visual comparisons between our framework and 4 state-of-the-art methods for text-to-3D generation. Experimental results show that our approach is capable of creating high-fidelity 3D assets that maintain consistent semantics with the given text prompts, significantly alleviating the color deviation, Janus problem, and semantically inconsistent details caused by inaccurate gradient estimation. The training time is evaluated on a single A100 GPU.
Figure 5: Visual results generated by our VividDreamer framework with 30 minutes of training on a single A100 GPU. As shown, our approach creates high-fidelity 3D assets based on various text prompts. More visual results can be found in our https://narcissusex.github.io/VividDreamer.
...and 5 more figures

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

TL;DR

Abstract

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)