Table of Contents
Fetching ...

VividDreamer: Invariant Score Distillation For Hyper-Realistic Text-to-3D Generation

Wenjie Zhuo, Fan Ma, Hehe Fan, Yi Yang

TL;DR

This work addresses over-saturation and over-smoothing in Score Distillation Sampling (SDS) for text-to-3D generation by decoupling SDS into a reconstruction term and a classifier-free guidance term. It introduces Invariant Score Distillation (ISD), which replaces the reconstruction term with an invariant score term derived from DDIM sampling, δ_inv = $\epsilon_\phi(z_{t-c};y,t-c) - \epsilon_\phi(z_t;y,t)$, enabling the use of a conventional guidance scale and reducing reconstruction-induced errors. The ISD framework combines δ_inv with the classifier-free guidance term δ_cls = $\epsilon_\phi(z_t;y,t) - \epsilon_\phi(z_t;∅,t)$ using a time-varying weight λ(t) and a fixed guidance weight, preserving detail while avoiding oversaturation. Extensive experiments on text-to-3DGS and text-to-NeRF show single-stage optimization with ISD yields realistic, highly detailed 3D objects and outperforms several baselines in both quantitative CLIP-based metrics and qualitative assessments, while maintaining efficiency and stability.

Abstract

This paper presents Invariant Score Distillation (ISD), a novel method for high-fidelity text-to-3D generation. ISD aims to tackle the over-saturation and over-smoothing problems in Score Distillation Sampling (SDS). In this paper, SDS is decoupled into a weighted sum of two components: the reconstruction term and the classifier-free guidance term. We experimentally found that over-saturation stems from the large classifier-free guidance scale and over-smoothing comes from the reconstruction term. To overcome these problems, ISD utilizes an invariant score term derived from DDIM sampling to replace the reconstruction term in SDS. This operation allows the utilization of a medium classifier-free guidance scale and mitigates the reconstruction-related errors, thus preventing the over-smoothing and over-saturation of results. Extensive experiments demonstrate that our method greatly enhances SDS and produces realistic 3D objects through single-stage optimization.

VividDreamer: Invariant Score Distillation For Hyper-Realistic Text-to-3D Generation

TL;DR

This work addresses over-saturation and over-smoothing in Score Distillation Sampling (SDS) for text-to-3D generation by decoupling SDS into a reconstruction term and a classifier-free guidance term. It introduces Invariant Score Distillation (ISD), which replaces the reconstruction term with an invariant score term derived from DDIM sampling, δ_inv = , enabling the use of a conventional guidance scale and reducing reconstruction-induced errors. The ISD framework combines δ_inv with the classifier-free guidance term δ_cls = using a time-varying weight λ(t) and a fixed guidance weight, preserving detail while avoiding oversaturation. Extensive experiments on text-to-3DGS and text-to-NeRF show single-stage optimization with ISD yields realistic, highly detailed 3D objects and outperforms several baselines in both quantitative CLIP-based metrics and qualitative assessments, while maintaining efficiency and stability.

Abstract

This paper presents Invariant Score Distillation (ISD), a novel method for high-fidelity text-to-3D generation. ISD aims to tackle the over-saturation and over-smoothing problems in Score Distillation Sampling (SDS). In this paper, SDS is decoupled into a weighted sum of two components: the reconstruction term and the classifier-free guidance term. We experimentally found that over-saturation stems from the large classifier-free guidance scale and over-smoothing comes from the reconstruction term. To overcome these problems, ISD utilizes an invariant score term derived from DDIM sampling to replace the reconstruction term in SDS. This operation allows the utilization of a medium classifier-free guidance scale and mitigates the reconstruction-related errors, thus preventing the over-smoothing and over-saturation of results. Extensive experiments demonstrate that our method greatly enhances SDS and produces realistic 3D objects through single-stage optimization.
Paper Structure (27 sections, 17 equations, 16 figures, 4 tables)

This paper contains 27 sections, 17 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Examples generated by our framework. Our methods can generate detailed and high-fidelity 3D objects from a wide range of textual prompts.
  • Figure 2: We separately utilize each term as the loss function to generate images in 2D experiments using SDS, with the prompt "An astronaut riding a horse". In the first line, we use a randomly noise as initialization, in the second line, we use a high-quality image sampled from Stable Diffusion 2-1 base rombach2022high as initialization. The results show that regardless of the initialization method, using each term alone cannot generate realistic and detailed samples.
  • Figure 3: In order to show the relationship between over-saturation and guidance scale $w$ in SDS, we use different guidance scales for text-to-image generation. We find that as $w$ increases, the over-saturation of the generated images gradually becomes serious, while as $w$ decreases, the images gradually become over-smoothing.
  • Figure 4: Overview of ISD for text-to-3D generation. We aim to optimize a 3D model $\theta$ using a pretrained text-to-image diffusion model. To achieve this, we render a 2D rendered image from $g(\theta, \pi)$ at a random pose $\pi$, and then employ a diffusion model $\epsilon_\phi$ to do the 2D Score Distillation. In particular, given a rendered image $x_\pi$, we first add noise to it to obtain $z_t$, and utilize the diffusion model to estimate the noise. In our framework, there are three noise predictions: $\epsilon_\phi(z_t, y, t)$, $\epsilon_\phi(z_t, \varnothing, t)$ and $\epsilon_\phi(z_{t-c}, y, t-c)$. We utilize them separately to compute the classifier-free guidance term $\delta_{cls}$ and the invariant score term $\delta_{inv}$, and then utilize our proposed ISD for optimization.
  • Figure 5: We use different noisy versions of a clear image as our initial images and add noise to each image with $t = 600$. We use different residual terms for denoising. The results show that when we continuously add noise to the initial image, our method can always restore more details than baseline methods
  • ...and 11 more figures