Table of Contents
Fetching ...

Rethinking Score Distillation as a Bridge Between Image Distributions

David McAllister, Songwei Ge, Jia-Bin Huang, David W. Jacobs, Alexei A. Efros, Aleksander Holynski, Angjoo Kanazawa

TL;DR

This work reframes Score Distillation Sampling (SDS) as solving a Schrödinger Bridge between a current source image distribution and a target natural-image distribution, revealing two core error modes: a first-order linear-approximation of the transport path and a mismatch between the current source distribution and the unconditional diffusion prior. By analyzing SDS variants through this dual-bridge lens, the authors show how artifacts like oversaturation arise when the source mismatch is large and demonstrate that describing the source distribution with textual prompts can markedly improve transport quality without additional computation. They validate a simple, effective alternative to heavy methods like LoRA by appending descriptive prompts to specify the current source distribution, achieving competitive results across text-to-image, text-guided NeRF, and painting-to-real tasks. The approach yields high-quality results with reduced artifacts and lower runtime, suggesting a practical pathway to generalized diffusion-prior optimization across data-poor domains, while highlighting future directions that combine multi-step transport and tailored schedules for further gains.

Abstract

Score distillation sampling (SDS) has proven to be an important tool, enabling the use of large-scale diffusion priors for tasks operating in data-poor domains. Unfortunately, SDS has a number of characteristic artifacts that limit its usefulness in general-purpose applications. In this paper, we make progress toward understanding the behavior of SDS and its variants by viewing them as solving an optimal-cost transport path from a source distribution to a target distribution. Under this new interpretation, these methods seek to transport corrupted images (source) to the natural image distribution (target). We argue that current methods' characteristic artifacts are caused by (1) linear approximation of the optimal path and (2) poor estimates of the source distribution. We show that calibrating the text conditioning of the source distribution can produce high-quality generation and translation results with little extra overhead. Our method can be easily applied across many domains, matching or beating the performance of specialized methods. We demonstrate its utility in text-to-2D, text-based NeRF optimization, translating paintings to real images, optical illusion generation, and 3D sketch-to-real. We compare our method to existing approaches for score distillation sampling and show that it can produce high-frequency details with realistic colors.

Rethinking Score Distillation as a Bridge Between Image Distributions

TL;DR

This work reframes Score Distillation Sampling (SDS) as solving a Schrödinger Bridge between a current source image distribution and a target natural-image distribution, revealing two core error modes: a first-order linear-approximation of the transport path and a mismatch between the current source distribution and the unconditional diffusion prior. By analyzing SDS variants through this dual-bridge lens, the authors show how artifacts like oversaturation arise when the source mismatch is large and demonstrate that describing the source distribution with textual prompts can markedly improve transport quality without additional computation. They validate a simple, effective alternative to heavy methods like LoRA by appending descriptive prompts to specify the current source distribution, achieving competitive results across text-to-image, text-guided NeRF, and painting-to-real tasks. The approach yields high-quality results with reduced artifacts and lower runtime, suggesting a practical pathway to generalized diffusion-prior optimization across data-poor domains, while highlighting future directions that combine multi-step transport and tailored schedules for further gains.

Abstract

Score distillation sampling (SDS) has proven to be an important tool, enabling the use of large-scale diffusion priors for tasks operating in data-poor domains. Unfortunately, SDS has a number of characteristic artifacts that limit its usefulness in general-purpose applications. In this paper, we make progress toward understanding the behavior of SDS and its variants by viewing them as solving an optimal-cost transport path from a source distribution to a target distribution. Under this new interpretation, these methods seek to transport corrupted images (source) to the natural image distribution (target). We argue that current methods' characteristic artifacts are caused by (1) linear approximation of the optimal path and (2) poor estimates of the source distribution. We show that calibrating the text conditioning of the source distribution can produce high-quality generation and translation results with little extra overhead. Our method can be easily applied across many domains, matching or beating the performance of specialized methods. We demonstrate its utility in text-to-2D, text-based NeRF optimization, translating paintings to real images, optical illusion generation, and 3D sketch-to-real. We compare our method to existing approaches for score distillation sampling and show that it can produce high-frequency details with realistic colors.
Paper Structure (30 sections, 13 equations, 14 figures, 3 tables)

This paper contains 30 sections, 13 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Optimization with diffusion models as approximation of a Schrödinger Bridge Problem (SBP). (a) We propose to formulate optimization with diffusion models as bridging the distribution of the current optimized image $x_\theta$ to the target distribution under a dual-bridge framework (a). Current methods can be interpreted as approximating the optimal transport $\epsilon^*_\text{SBP}$ between these distributions via the difference between projections of a noised image $x_{\theta,t}$ onto the two distributions. This analysis reveals two sources of error: (1) these gradients are linear approximations of the optimal path, as illustrated in (a), and (2) the source distribution used for computing this approximation (e.g., the unconditional distribution in SDS poole2022dreamfusion) may not be aligned with the current distribution, illustrated in (b).
  • Figure 2: Comparision of SDS variants under our analysis. We illustrate the major gradient components of different SDS variants and provide a straightforward comparison with $\mathbf{\epsilon}_\text{SBP}$.
  • Figure 2: Quantitative comparisons of NeRF optimization. We measure the average CLIP similarity of rendered views using SDS, VSD and our.
  • Figure 3: Text-to-image generation results with COCO Captions. We compare different score distillation methods for generating images with COCO captions by optimizing a randomly initialized image. DDIM sampling indicates the lower bound that the diffusion model can achieve. VSD wang2023prolificdreamer and our method generate the least color artifacts while ours is more efficient than VSD.
  • Figure 4: Text-guided NeRF optimization with different score distillation methods. We make a fair comparison of SDS and VSD for text-to-3D generation. For each generation, we show three uniformly sampled views. SDS results like the cottage and pepper mill still suffer from over-saturation problems, while ours and VSD can produce realistic details, color, and texture.
  • ...and 9 more figures