Table of Contents
Fetching ...

Connecting Consistency Distillation to Score Distillation for Text-to-3D Generation

Zongrui Li, Minghui Hu, Qian Zheng, Xudong Jiang

TL;DR

This work addresses the persistent lack of detail and fidelity in text-to-3D generation by linking consistency distillation (CM) with score distillation (SDS) through PF-ODEs. It introduces Guided Consistency Sampling (GCS), which comprises Compact Consistency loss ($\mathcal{L}_{\text{CC}}$), Conditional Guidance loss ($\mathcal{L}_{\text{CG}}$), and Pixel-domain constraint loss ($\mathcal{L}_{\text{CP}}$), and augments 3D Gaussian Splatting (3DGS) with Brightness-equalized Generation (BEG) to address over-saturation. The approach yields improved detail and fidelity over state-of-the-art methods, supported by qualitative and quantitative results and an ablation study; it also provides theoretical connections between CM and SDS, including an $\mathcal{L}_{\text{GCS}}$ objective. Code release facilitates reproducibility and adoption in the text-to-3D generation community.

Abstract

Although recent advancements in text-to-3D generation have significantly improved generation quality, issues like limited level of detail and low fidelity still persist, which requires further improvement. To understand the essence of those issues, we thoroughly analyze current score distillation methods by connecting theories of consistency distillation to score distillation. Based on the insights acquired through analysis, we propose an optimization framework, Guided Consistency Sampling (GCS), integrated with 3D Gaussian Splatting (3DGS) to alleviate those issues. Additionally, we have observed the persistent oversaturation in the rendered views of generated 3D assets. From experiments, we find that it is caused by unwanted accumulated brightness in 3DGS during optimization. To mitigate this issue, we introduce a Brightness-Equalized Generation (BEG) scheme in 3DGS rendering. Experimental results demonstrate that our approach generates 3D assets with more details and higher fidelity than state-of-the-art methods. The codes are released at https://github.com/LMozart/ECCV2024-GCS-BEG.

Connecting Consistency Distillation to Score Distillation for Text-to-3D Generation

TL;DR

This work addresses the persistent lack of detail and fidelity in text-to-3D generation by linking consistency distillation (CM) with score distillation (SDS) through PF-ODEs. It introduces Guided Consistency Sampling (GCS), which comprises Compact Consistency loss (), Conditional Guidance loss (), and Pixel-domain constraint loss (), and augments 3D Gaussian Splatting (3DGS) with Brightness-equalized Generation (BEG) to address over-saturation. The approach yields improved detail and fidelity over state-of-the-art methods, supported by qualitative and quantitative results and an ablation study; it also provides theoretical connections between CM and SDS, including an objective. Code release facilitates reproducibility and adoption in the text-to-3D generation community.

Abstract

Although recent advancements in text-to-3D generation have significantly improved generation quality, issues like limited level of detail and low fidelity still persist, which requires further improvement. To understand the essence of those issues, we thoroughly analyze current score distillation methods by connecting theories of consistency distillation to score distillation. Based on the insights acquired through analysis, we propose an optimization framework, Guided Consistency Sampling (GCS), integrated with 3D Gaussian Splatting (3DGS) to alleviate those issues. Additionally, we have observed the persistent oversaturation in the rendered views of generated 3D assets. From experiments, we find that it is caused by unwanted accumulated brightness in 3DGS during optimization. To mitigate this issue, we introduce a Brightness-Equalized Generation (BEG) scheme in 3DGS rendering. Experimental results demonstrate that our approach generates 3D assets with more details and higher fidelity than state-of-the-art methods. The codes are released at https://github.com/LMozart/ECCV2024-GCS-BEG.
Paper Structure (20 sections, 2 theorems, 29 equations, 13 figures, 2 tables)

This paper contains 20 sections, 2 theorems, 29 equations, 13 figures, 2 tables.

Key Result

lemma thmcounterlemma

Let $\Delta t =\max \left\{\left|\delta_{k}\right|\right\}$, $k \in [0, ..., n_s]$, where $n_s$ is the index of $\delta$ at time step $s$, and $F_{\theta}(\cdot, \cdot)$ is the origin prediction function grounded on the empirical PF-ODE. Assume $F_{\theta}$ satisfies the Lipschitz condition, if ther $\hat{{\mathbf{x}}}_{\{0,e\}}$ is the distribution of ${\mathbf{x}}_0$ diffused to time $e$, $p$ is

Figures (13)

  • Figure 1: Text-to-3D generation results of the proposed Guided Consistency Sampling (GCS) and Brightness-equalized Generation. Each 3D asset is distilled from a pre-trained 2D diffusion model and demonstrated with three different views. The results below the dotted line are generated by the fine-tuned diffusion models.
  • Figure 2: An overview of the proposed GCS. We first initialize the 3D representation via the pre-trained 3D generator. For each training epoch, we randomly render a batch of views $\mathbf{x}_\pi$ and diffuse them to $\mathbf{x}_{\{\pi, e\}}$ with a fixed noise $\epsilon^*$. We then apply the ODE diffusion process to gradually add noise to the $\mathbf{x}_{\{\pi, e\}}$ and transfer it to $\bar{\mathbf{x}}_{\{\pi, e\rightarrow s\}}$ and $\bar{\mathbf{x}}_{\{\pi, s\rightarrow t\}}$. In the denoising path, we conduct conditional and unconditional denoising steps, as shown in the figure. Eventually, we calculate the $\mathcal{L}_{\text{GCS}}$ (Eq. \ref{['eq:gcs']}) to update the parameters of 3D representation (3DGS). Note that we add '$*$' on $\bar{\mathbf{x}}_{\{\pi, e \rightarrow 0; y\}}$ to indicate that it is obtained from different sampling trajectories.
  • Figure 3: Generated views by using $\mathcal{L}_{\text{CC}}$ with different CFG strategies at a low CFG weight ($w=7.5$). While $\mathcal{L}_{\text{CC}}$ (left) implements CFG in only one denoising step, $\mathcal{L}^*_{\text{CC}}$ (right) applies CFG in every ODE denoising step.
  • Figure 4: Qualitative comparison among the proposed and other methods on text-to-3D generation results. From left to right, results generated by DreamFusion poole2022dreamfusion, GaussianDreamer yi2023gaussiandreamer, ProlificDreamer wang2024prolificdreamer, LucidDreamer liang2023luciddreamer, and the proposed method, with a CFG weight $100$, $100$, $7$, $7$, $7$, respectively. For each sub-figure, left: main view, right-top: back view, right-bottom: normal/depth map of the 3D asset.
  • Figure 5: Ablation study of proposed components.LucidDreamer liang2023luciddreamer serves as the baseline, we demonstrate the results under settings: (a) LucidDreamer liang2023luciddreamer + BEG, (b) $\mathcal{L}_\mathrm{C C}(\xi)+\mathcal{L}_\mathrm{C G}(\xi)$, (c) $\mathcal{L}_\mathrm{GCS}(\xi)$, and (d) full mode ($\mathcal{L}_\mathrm{GCS}(\xi)$+ BEG) from left to right.
  • ...and 8 more figures

Theorems & Definitions (4)

  • lemma thmcounterlemma: zheng2024trajectorysong2023consistencykim2023consistencywu2024consistent3d
  • proof
  • lemma thmcounterlemma: zheng2024trajectorysong2023consistencykim2023consistencywu2024consistent3d
  • proof