Table of Contents
Fetching ...

DreamMapping: High-Fidelity Text-to-3D Generation via Variational Distribution Mapping

Zeyu Cai, Duotun Wang, Yixun Liang, Zhijing Shao, Ying-Cong Chen, Xiaohang Zhan, Zeyu Wang

TL;DR

A novel strategy called Variational Distribution Mapping (VDM), which expedites the distribution modeling process by regarding the rendered images as instances of degradation from diffusion-based generation, and enables the efficient training of variational distribution by skipping the calculations of the Jacobians in the diffusion U-Net.

Abstract

Score Distillation Sampling (SDS) has emerged as a prevalent technique for text-to-3D generation, enabling 3D content creation by distilling view-dependent information from text-to-2D guidance. However, they frequently exhibit shortcomings such as over-saturated color and excess smoothness. In this paper, we conduct a thorough analysis of SDS and refine its formulation, finding that the core design is to model the distribution of rendered images. Following this insight, we introduce a novel strategy called Variational Distribution Mapping (VDM), which expedites the distribution modeling process by regarding the rendered images as instances of degradation from diffusion-based generation. This special design enables the efficient training of variational distribution by skipping the calculations of the Jacobians in the diffusion U-Net. We also introduce timestep-dependent Distribution Coefficient Annealing (DCA) to further improve distilling precision. Leveraging VDM and DCA, we use Gaussian Splatting as the 3D representation and build a text-to-3D generation framework. Extensive experiments and evaluations demonstrate the capability of VDM and DCA to generate high-fidelity and realistic assets with optimization efficiency.

DreamMapping: High-Fidelity Text-to-3D Generation via Variational Distribution Mapping

TL;DR

A novel strategy called Variational Distribution Mapping (VDM), which expedites the distribution modeling process by regarding the rendered images as instances of degradation from diffusion-based generation, and enables the efficient training of variational distribution by skipping the calculations of the Jacobians in the diffusion U-Net.

Abstract

Score Distillation Sampling (SDS) has emerged as a prevalent technique for text-to-3D generation, enabling 3D content creation by distilling view-dependent information from text-to-2D guidance. However, they frequently exhibit shortcomings such as over-saturated color and excess smoothness. In this paper, we conduct a thorough analysis of SDS and refine its formulation, finding that the core design is to model the distribution of rendered images. Following this insight, we introduce a novel strategy called Variational Distribution Mapping (VDM), which expedites the distribution modeling process by regarding the rendered images as instances of degradation from diffusion-based generation. This special design enables the efficient training of variational distribution by skipping the calculations of the Jacobians in the diffusion U-Net. We also introduce timestep-dependent Distribution Coefficient Annealing (DCA) to further improve distilling precision. Leveraging VDM and DCA, we use Gaussian Splatting as the 3D representation and build a text-to-3D generation framework. Extensive experiments and evaluations demonstrate the capability of VDM and DCA to generate high-fidelity and realistic assets with optimization efficiency.
Paper Structure (22 sections, 20 equations, 12 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 20 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 2: Advantage of VDM. Other solutions modeling the variational distribution of rendered images require extra time to calculate the complex UNet Jacobian matrix in diffusion models. For instance, methods applying LoRA ProlificDreamer:NIPS:2023DreamFlow:Arxiv:2024ASD:Arxiv:2023LODS:Arxiv:2023 and Learnable Embedding LODS:Arxiv:2023, when optimizing the variational distribution, the gradient backward must pass through the Stable Diffusion UNet, leading to extra computing time. Our VDM overcomes this problem, taking less time to optimize.
  • Figure 3: Framework overview. In our text-to-3D generation, we start with the shape initialization (i.e., Shape-E UnpairedShape:Graph:2018) of the 3D representations $\theta$ based on the text input $y$. By incorporating pre-trained Stable Diffusion, we disturb rendered images of random views $\mathbf{x}=g(\theta,c)$ to noisy latents $\mathbf{x}_t$. After learning the image degradation $\psi$, we update $\theta$ with the VDM-based loss $\mathcal{L}_{VDM}$. It is worth noting that the gradient flows bypass the frozen UNet Jacobian terms of Stable Diffusion, significantly expediting the optimization process.
  • Figure 4: Qualitative comparisons with recent popular methods in text-to-3D generation based on 3DGS and NeRF. We present rendered images of two views for each method. Experimental results demonstrate that our method generates 3D content closely aligned with textual prompts, exhibiting high fidelity and intricate details. Please zoom in for details. Additional comparisons can be found in Figure \ref{['fig:add_vis_comp']}.
  • Figure 5: Ablation study on EDM and DCA. Compared to SDS, EDM significantly adds appearance details to 3D models, and DCA further controls color saturation.
  • Figure 6: Ablation study on designs of the image degradation process. We show the effects of modeling this degradation with the linear learnable operator, nonlinear learnable operator with noise, and our choice, a noise-free nonlinear learnable operator ($M_\psi$).
  • ...and 7 more figures