Table of Contents
Fetching ...

ExactDreamer: High-Fidelity Text-to-3D Content Creation via Exact Score Matching

Yumin Zhang, Xingyu Miao, Haoran Duan, Bo Wei, Tejal Shah, Yang Long, Rajiv Ranjan

TL;DR

This work tackles the fidelity gap in diffusion-based text-to-3D generation by addressing DDIM inversion bias. It introduces Exact Score Matching (ESM), which uses auxiliary variables and a LoRA-driven recovery path to achieve exact recovery in the DDIM reverse process, mitigating the accumulation of errors that cause over-smoothing and content loss. Empirical results on Gaussian Splatting-based 3D representations demonstrate improved detail and prompt alignment over strong baselines, with careful analysis of hyperparameters and initialization effects. While showing practical gains on high-fidelity 3D content, the approach acknowledges potential instability and sensitivity to settings that warrant further refinement.

Abstract

Text-to-3D content creation is a rapidly evolving research area. Given the scarcity of 3D data, current approaches often adapt pre-trained 2D diffusion models for 3D synthesis. Among these approaches, Score Distillation Sampling (SDS) has been widely adopted. However, the issue of over-smoothing poses a significant limitation on the high-fidelity generation of 3D models. To address this challenge, LucidDreamer replaces the Denoising Diffusion Probabilistic Model (DDPM) in SDS with the Denoising Diffusion Implicit Model (DDIM) to construct Interval Score Matching (ISM). However, ISM inevitably inherits inconsistencies from DDIM, causing reconstruction errors during the DDIM inversion process. This results in poor performance in the detailed generation of 3D objects and loss of content. To alleviate these problems, we propose a novel method named Exact Score Matching (ESM). Specifically, ESM leverages auxiliary variables to mathematically guarantee exact recovery in the DDIM reverse process. Furthermore, to effectively capture the dynamic changes of the original and auxiliary variables, the LoRA of a pre-trained diffusion model implements these exact paths. Extensive experiments demonstrate the effectiveness of ESM in text-to-3D generation, particularly highlighting its superiority in detailed generation.

ExactDreamer: High-Fidelity Text-to-3D Content Creation via Exact Score Matching

TL;DR

This work tackles the fidelity gap in diffusion-based text-to-3D generation by addressing DDIM inversion bias. It introduces Exact Score Matching (ESM), which uses auxiliary variables and a LoRA-driven recovery path to achieve exact recovery in the DDIM reverse process, mitigating the accumulation of errors that cause over-smoothing and content loss. Empirical results on Gaussian Splatting-based 3D representations demonstrate improved detail and prompt alignment over strong baselines, with careful analysis of hyperparameters and initialization effects. While showing practical gains on high-fidelity 3D content, the approach acknowledges potential instability and sensitivity to settings that warrant further refinement.

Abstract

Text-to-3D content creation is a rapidly evolving research area. Given the scarcity of 3D data, current approaches often adapt pre-trained 2D diffusion models for 3D synthesis. Among these approaches, Score Distillation Sampling (SDS) has been widely adopted. However, the issue of over-smoothing poses a significant limitation on the high-fidelity generation of 3D models. To address this challenge, LucidDreamer replaces the Denoising Diffusion Probabilistic Model (DDPM) in SDS with the Denoising Diffusion Implicit Model (DDIM) to construct Interval Score Matching (ISM). However, ISM inevitably inherits inconsistencies from DDIM, causing reconstruction errors during the DDIM inversion process. This results in poor performance in the detailed generation of 3D objects and loss of content. To alleviate these problems, we propose a novel method named Exact Score Matching (ESM). Specifically, ESM leverages auxiliary variables to mathematically guarantee exact recovery in the DDIM reverse process. Furthermore, to effectively capture the dynamic changes of the original and auxiliary variables, the LoRA of a pre-trained diffusion model implements these exact paths. Extensive experiments demonstrate the effectiveness of ESM in text-to-3D generation, particularly highlighting its superiority in detailed generation.
Paper Structure (21 sections, 1 theorem, 11 equations, 6 figures, 1 algorithm)

This paper contains 21 sections, 1 theorem, 11 equations, 6 figures, 1 algorithm.

Key Result

Theorem 1

In diffusion-based 3D model generation, ESM is more effective than ISM in reducing accumulated error (i.e., $\epsilon_{\rm{ESM}} < \epsilon_{\rm{ISM}}$), thereby enhancing the detail fidelity of 3D representations.

Figures (6)

  • Figure 1: Text-to-3D samples generated from our framework. We propose a novel method named Exact Score Matching (ESM) that utilizes the pre-trained 2D diffusion model to guide high-fidelity content generation. The generative 3D results illustrate the superiority of ESM.
  • Figure 2: Overview of our framework. Under the camera pose $C$, the Gaussian Splatting $\Theta$ is rendered to 2D image $\mathbf{x}_{0} = \mathcal{R}(\Theta, C)$ and then approach to $\mathbf{x}_{s}$ via DDIM inversion. To construct the exact recovery path, we introduce an auxiliary variable $\mathbf{x}_{s}'$ that is copied from $\mathbf{x}_{s}$. Intermediate variables $\mathbf{x}_{t}$ and $\mathbf{x}_{t}'$ are estimated by the LoRA, and then mixed to obtain $\mathbf{x}_{t}$. Finally, $\Theta$ is updated via optimizing $\mathcal{L}_{\rm{ESM}}$ calculated by $\mathbf{x}_{s}$ and $\mathbf{x}_{t}$.
  • Figure 3: Comparison with baselines in text-to-3D generation. We compare our methods with current SoTA methods, and our method performs better in detail.
  • Figure 4: Effect of mixed ratio. We tune the $\rho$ in the interval $\{0.1, 0.3, 0.5, 0.7, 0.9\}$. The generated results show that a higher mixed ratio is typically beneficial for high-fidelity generation.
  • Figure 5: Effect of step size. The generated results are guided by the prompt A red motocycle. The parameters $\delta_{S}$ and $\delta_{T}$ are tuned in the sets $\{50, 100, 150, 200\}$ and $\{25, 50, 150, 200\}$, respectively. The results illustrate that the step sizes have a significant influence on the clarity of generated 3D objections. Obviously, higher $\delta_{T}$ result in blurrier results.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 1