Score Distillation Sampling with Learned Manifold Corrective
Thiemo Alldieck, Nikos Kolotouros, Cristian Sminchisescu
TL;DR
Score Distillation Sampling (SDS) leverages a pretrained diffusion prior to steer optimization with text prompts. The authors decompose the SDS loss into a prompt-consistency term and a projection term, reveal a time-dependent frequency bias that induces noisy gradients, and propose Score Distillation Sampling with Learned Manifold Corrective (LMC-SDS) to factor out this bias using a learned corrective $\hat{b}_{\boldsymbol{\psi}}$ that aligns gradients with the natural image manifold. By training $\hat{b}_{\boldsymbol{\psi}}$ and dropping Jacobian terms, LMC-SDS produces cleaner gradients and reduces the need for high guidance, improving fidelity across optimization, editing, image-to-image translation, and text-to-3D tasks. Experiments across synthesis, editing, and DreamFusion demonstrate visible gains in detail, color realism, and diversity, with competitive or superior CLIP-based metrics and preserved image statistics.
Abstract
Score Distillation Sampling (SDS) is a recent but already widely popular method that relies on an image diffusion model to control optimization problems using text prompts. In this paper, we conduct an in-depth analysis of the SDS loss function, identify an inherent problem with its formulation, and propose a surprisingly easy but effective fix. Specifically, we decompose the loss into different factors and isolate the component responsible for noisy gradients. In the original formulation, high text guidance is used to account for the noise, leading to unwanted side effects such as oversaturation or repeated detail. Instead, we train a shallow network mimicking the timestep-dependent frequency bias of the image diffusion model in order to effectively factor it out. We demonstrate the versatility and the effectiveness of our novel loss formulation through qualitative and quantitative experiments, including optimization-based image synthesis and editing, zero-shot image translation network training, and text-to-3D synthesis.
