Table of Contents
Fetching ...

Taming Mode Collapse in Score Distillation for Text-to-3D Generation

Peihao Wang, Dejia Xu, Zhiwen Fan, Dilin Wang, Sreyas Mohan, Forrest Iandola, Rakesh Ranjan, Yilei Li, Qiang Liu, Zhangyang Wang, Vikas Chandra

TL;DR

To tame mode collapse, the existing score distillation-based text-to-3D generation frameworks degenerate to maximal likelihood seeking on each view independently and thus suffer from the mode collapse problem, manifesting as the Janus artifact in practice.

Abstract

Despite the remarkable performance of score distillation in text-to-3D generation, such techniques notoriously suffer from view inconsistency issues, also known as "Janus" artifact, where the generated objects fake each view with multiple front faces. Although empirically effective methods have approached this problem via score debiasing or prompt engineering, a more rigorous perspective to explain and tackle this problem remains elusive. In this paper, we reveal that the existing score distillation-based text-to-3D generation frameworks degenerate to maximal likelihood seeking on each view independently and thus suffer from the mode collapse problem, manifesting as the Janus artifact in practice. To tame mode collapse, we improve score distillation by re-establishing the entropy term in the corresponding variational objective, which is applied to the distribution of rendered images. Maximizing the entropy encourages diversity among different views in generated 3D assets, thereby mitigating the Janus problem. Based on this new objective, we derive a new update rule for 3D score distillation, dubbed Entropic Score Distillation (ESD). We theoretically reveal that ESD can be simplified and implemented by just adopting the classifier-free guidance trick upon variational score distillation. Although embarrassingly straightforward, our extensive experiments successfully demonstrate that ESD can be an effective treatment for Janus artifacts in score distillation.

Taming Mode Collapse in Score Distillation for Text-to-3D Generation

TL;DR

To tame mode collapse, the existing score distillation-based text-to-3D generation frameworks degenerate to maximal likelihood seeking on each view independently and thus suffer from the mode collapse problem, manifesting as the Janus artifact in practice.

Abstract

Despite the remarkable performance of score distillation in text-to-3D generation, such techniques notoriously suffer from view inconsistency issues, also known as "Janus" artifact, where the generated objects fake each view with multiple front faces. Although empirically effective methods have approached this problem via score debiasing or prompt engineering, a more rigorous perspective to explain and tackle this problem remains elusive. In this paper, we reveal that the existing score distillation-based text-to-3D generation frameworks degenerate to maximal likelihood seeking on each view independently and thus suffer from the mode collapse problem, manifesting as the Janus artifact in practice. To tame mode collapse, we improve score distillation by re-establishing the entropy term in the corresponding variational objective, which is applied to the distribution of rendered images. Maximizing the entropy encourages diversity among different views in generated 3D assets, thereby mitigating the Janus problem. Based on this new objective, we derive a new update rule for 3D score distillation, dubbed Entropic Score Distillation (ESD). We theoretically reveal that ESD can be simplified and implemented by just adopting the classifier-free guidance trick upon variational score distillation. Although embarrassingly straightforward, our extensive experiments successfully demonstrate that ESD can be an effective treatment for Janus artifacts in score distillation.
Paper Structure (44 sections, 9 theorems, 37 equations, 10 figures, 3 tables)

This paper contains 44 sections, 9 theorems, 37 equations, 10 figures, 3 tables.

Key Result

Theorem 1

For any $\lambda \in \mathbb{R}$ and $\boldsymbol{\theta} \in \mathbb{R}^{D}$, we have $J_{Ent}(\boldsymbol{\theta}, \lambda) = \lambda \mathop{\mathrm{\mathbb{E}}}\nolimits_{t}[ \Omega(t) \mathop{\mathrm{\mathcal{D}_{KL}}}\nolimits(q_t^{\boldsymbol{\theta}}(\boldsymbol{x}_t | \boldsymbol{y}) \Vert

Figures (10)

  • Figure 1: A Preview of Qualitative Results. We present the front and back views of objects synthesized by VSD (ProlificDreamer) on the right two columns, and four views of our generated results on the left. VSD suffers from "Janus" problem, where both front and back views contain a frontal face of the targeted object, while our method effectively mitigates this artifact. Please refer to more results in Appendix \ref{['sec:more_vis_res']}.
  • Figure 2: Illustration of the effect of entropy regularization. Learned image distributions often exhibit a higher probability mass for objects' frontal faces. Pure maximal likelihood seeking is opt to mode collapse (Sec. \ref{['sec:mode_collapse']}). Adding entropy regularization can expand the support of fitted distribution $q^{\boldsymbol{\theta}}_t(\boldsymbol{x} | \boldsymbol{y})$ with mode-covering behavior (Sec. \ref{['sec:method']}).
  • Figure 3: Gaussian Example. To illustrate the effects of entropy regularization, we leverage SDS, VSD and ESD to fit a 2D Gaussian distribution. The blue points are sampled from the ground-truth distribution while the orange points are from the fitted distribution.
  • Figure 4: Qualitative Results. Our proposed outperforms all baselines in terms of better geometry and well-constructed texture details. Our results deliver photo-realistic and diverse rendered views, while baseline methods more or less suffer from the Janus problem. Best view in an electronic copy.
  • Figure 5: Qualitative Results. We combine our proposed ESD with timestep scheduling in DreamTime huang2023dreamtime and compare it against baseline methods. Prompt: A caramic lion.
  • ...and 5 more figures

Theorems & Definitions (19)

  • Theorem 1
  • Lemma 1: Gradient of $J_{KL}$
  • proof
  • Lemma 2: SDS minimizes $J_{KL}$ poole2022dreamfusion
  • proof
  • Lemma 3: Single-particle VSD minimizes $J_{KL}$ wang2023prolificdreamer
  • proof
  • Remark 1
  • Lemma 4: $J_{KL}$ is equivalent to maximal likelihood estimation
  • proof
  • ...and 9 more