Table of Contents
Fetching ...

Score Distillation Sampling with Learned Manifold Corrective

Thiemo Alldieck, Nikos Kolotouros, Cristian Sminchisescu

TL;DR

Score Distillation Sampling (SDS) leverages a pretrained diffusion prior to steer optimization with text prompts. The authors decompose the SDS loss into a prompt-consistency term and a projection term, reveal a time-dependent frequency bias that induces noisy gradients, and propose Score Distillation Sampling with Learned Manifold Corrective (LMC-SDS) to factor out this bias using a learned corrective $\hat{b}_{\boldsymbol{\psi}}$ that aligns gradients with the natural image manifold. By training $\hat{b}_{\boldsymbol{\psi}}$ and dropping Jacobian terms, LMC-SDS produces cleaner gradients and reduces the need for high guidance, improving fidelity across optimization, editing, image-to-image translation, and text-to-3D tasks. Experiments across synthesis, editing, and DreamFusion demonstrate visible gains in detail, color realism, and diversity, with competitive or superior CLIP-based metrics and preserved image statistics.

Abstract

Score Distillation Sampling (SDS) is a recent but already widely popular method that relies on an image diffusion model to control optimization problems using text prompts. In this paper, we conduct an in-depth analysis of the SDS loss function, identify an inherent problem with its formulation, and propose a surprisingly easy but effective fix. Specifically, we decompose the loss into different factors and isolate the component responsible for noisy gradients. In the original formulation, high text guidance is used to account for the noise, leading to unwanted side effects such as oversaturation or repeated detail. Instead, we train a shallow network mimicking the timestep-dependent frequency bias of the image diffusion model in order to effectively factor it out. We demonstrate the versatility and the effectiveness of our novel loss formulation through qualitative and quantitative experiments, including optimization-based image synthesis and editing, zero-shot image translation network training, and text-to-3D synthesis.

Score Distillation Sampling with Learned Manifold Corrective

TL;DR

Score Distillation Sampling (SDS) leverages a pretrained diffusion prior to steer optimization with text prompts. The authors decompose the SDS loss into a prompt-consistency term and a projection term, reveal a time-dependent frequency bias that induces noisy gradients, and propose Score Distillation Sampling with Learned Manifold Corrective (LMC-SDS) to factor out this bias using a learned corrective that aligns gradients with the natural image manifold. By training and dropping Jacobian terms, LMC-SDS produces cleaner gradients and reduces the need for high guidance, improving fidelity across optimization, editing, image-to-image translation, and text-to-3D tasks. Experiments across synthesis, editing, and DreamFusion demonstrate visible gains in detail, color realism, and diversity, with competitive or superior CLIP-based metrics and preserved image statistics.

Abstract

Score Distillation Sampling (SDS) is a recent but already widely popular method that relies on an image diffusion model to control optimization problems using text prompts. In this paper, we conduct an in-depth analysis of the SDS loss function, identify an inherent problem with its formulation, and propose a surprisingly easy but effective fix. Specifically, we decompose the loss into different factors and isolate the component responsible for noisy gradients. In the original formulation, high text guidance is used to account for the noise, leading to unwanted side effects such as oversaturation or repeated detail. Instead, we train a shallow network mimicking the timestep-dependent frequency bias of the image diffusion model in order to effectively factor it out. We demonstrate the versatility and the effectiveness of our novel loss formulation through qualitative and quantitative experiments, including optimization-based image synthesis and editing, zero-shot image translation network training, and text-to-3D synthesis.
Paper Structure (25 sections, 12 equations, 15 figures, 1 table)

This paper contains 25 sections, 12 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: Left: Visualization of SDS and LMC-SDS gradients w.r.t. pixel values and estimated denoised images for the given image $\mathbf{z}$, the prompt $\boldsymbol{y}=\text{"autumn"}$, and $t=0.5$. We visualize the negative gradient, i.e. the direction of change. Right: Power spectra of denoised images $\hat{\mathbf{z}}_t$ for varying time-step $t$ compared to the power spectrum of natural images $\mathbf{z}$. See \ref{['sec:analysis']} for details.
  • Figure 2: Optimization-based image synthesis results. We optimize an empty image to match a given prompt using our LMC-SDS, the original SDS loss, and $\mathcal{L}_{\text{cond}}$. SDS struggles to create detailed content when using low guidance $\omega$. High $\omega$ produces detailed results but colors may be oversaturated (chimpanzee face), fake detail may appear (2nd mouse tail), or artifacts emerge. $\mathcal{L}_{\text{cond}}$ is unstable to optimize and produces unrealistic colors. In contrast, our method produces detailed results with balanced colors.
  • Figure 3: Examples of optimization-based image editing results. We show pairs of input images (left) and editing result (right).
  • Figure 4: By fixing $\boldsymbol{\epsilon}$ in $\mathcal{L}_{\text{cond}}$ we can obtain diverse editing results. We show four variants of optimization-based editing results of the input image under the given prompt "a mountain during sunset".
  • Figure 5: Quantitative results for optimization-based editing under varying $\omega$. Left: Our method results in the highest CLIP score over all baselines for all $\omega$. Right: We plot LPIPS over CLIP for further performance insights: DDS stays close to the original image (lowest LPIPS) by performing only small edits (low CLIP). SDS and MS-SDS respect the prompt better (higher CLIP), but corrupt the image (high LPIPS). NFSD corrupts the image less (lower LPIPS), but exhibits weak editing capabilities (low CLIP). Our method shows the strongest editing capabilities (highest CLIP), while staying close to the original structure.
  • ...and 10 more figures