Score Distillation via Reparametrized DDIM

Artem Lukoianov; Haitz Sáez de Ocáriz Borde; Kristjan Greenewald; Vitor Campagnolo Guizilini; Timur Bagautdinov; Vincent Sitzmann; Justin Solomon

Score Distillation via Reparametrized DDIM

Artem Lukoianov, Haitz Sáez de Ocáriz Borde, Kristjan Greenewald, Vitor Campagnolo Guizilini, Timur Bagautdinov, Vincent Sitzmann, Justin Solomon

TL;DR

This paper identifies why Score Distillation Sampling (SDS) struggles to produce high-fidelity 3D shapes: SDS guides 3D renders using a high-variance, iid noise term that misaligns with the DDIM denoising trajectory. By reparameterizing the diffusion process in terms of the single-step denoised image $x_0(t)$ and viewing SDS as a DDIM-like velocity field, the authors derive Score Distillation via Inversion (SDI), which replaces the random noise with conditional DDIM inversion noise $oldsymbol{}^t_y(x_0(t))$. SDI preserves the 2D diffusion model’s quality in 2D while substantially improving 3D geometry and texture, achieving comparable or superior results to state-of-the-art SDS methods without additional training or multi-view supervision. The approach offers a theoretical bridge between 2D diffusion sampling and 3D asset generation, reducing over-saturation and preserving high-frequency detail, with practical implications for single-view training pipelines and broader diffusion-based 3D synthesis. Limitations include 3D view consistency challenges and potential diffusion-model biases, suggesting future work in depth/normal supervision and multi-view conditioning.

Abstract

While 2D diffusion models generate realistic, high-detail images, 3D shape generation methods like Score Distillation Sampling (SDS) built on these 2D diffusion models produce cartoon-like, over-smoothed shapes. To help explain this discrepancy, we show that the image guidance used in Score Distillation can be understood as the velocity field of a 2D denoising generative process, up to the choice of a noise term. In particular, after a change of variables, SDS resembles a high-variance version of Denoising Diffusion Implicit Models (DDIM) with a differently-sampled noise term: SDS introduces noise i.i.d. randomly at each step, while DDIM infers it from the previous noise predictions. This excessive variance can lead to over-smoothing and unrealistic outputs. We show that a better noise approximation can be recovered by inverting DDIM in each SDS update step. This modification makes SDS's generative process for 2D images almost identical to DDIM. In 3D, it removes over-smoothing, preserves higher-frequency detail, and brings the generation quality closer to that of 2D samplers. Experimentally, our method achieves better or similar 3D generation quality compared to other state-of-the-art Score Distillation methods, all without training additional neural networks or multi-view supervision, and providing useful insights into relationship between 2D and 3D asset generation with diffusion models.

Score Distillation via Reparametrized DDIM

TL;DR

and viewing SDS as a DDIM-like velocity field, the authors derive Score Distillation via Inversion (SDI), which replaces the random noise with conditional DDIM inversion noise

. SDI preserves the 2D diffusion model’s quality in 2D while substantially improving 3D geometry and texture, achieving comparable or superior results to state-of-the-art SDS methods without additional training or multi-view supervision. The approach offers a theoretical bridge between 2D diffusion sampling and 3D asset generation, reducing over-saturation and preserving high-frequency detail, with practical implications for single-view training pipelines and broader diffusion-based 3D synthesis. Limitations include 3D view consistency challenges and potential diffusion-model biases, suggesting future work in depth/normal supervision and multi-view conditioning.

Abstract

Paper Structure (29 sections, 14 equations, 30 figures, 2 tables, 2 algorithms)

This paper contains 29 sections, 14 equations, 30 figures, 2 tables, 2 algorithms.

Introduction
Related work
Background
Linking SDS to DDIM
Score Distillation via Inversion (SDI)
Experiments
3D generation
Ablations
Conclusion, Limitations, and Future Work
Implementation details.
Timesteps.
Geometry regularization.
System details.
Prompts used in the quantitative evaluation
Comparison with Interval Score Matching
...and 14 more sections

Figures (30)

Figure 1: Score Distillation Sampling (SDS) "distills" 3D shapes from 2D image generative models like DDIM. While DDIM produces high-quality images (a), the same diffusion model, yields blurry results with SDS in the task of 2D image generation (b); in 3D, SDS yields over-saturated and simplified shapes (d). By replacing the noise term in SDS to agree with DDIM, our algorithm better matches the quality of the diffusion model in 2D (c) and significantly improves 3D generation (e).
Figure 2: Examples of 3D objects generated with our method.
Figure 3: The effect of CFG values on 2D generation with StableDiffusion 2.1 Rombach_2022_CVPR. For small values, the model tends to ignore certain words in the prompt. For high values, images become over-saturated.
Figure 4: Left: Evolution of variables in Score Distillation with time. The top row depicts how noisy images $x(t)$ evolve during 2D generation; the middle row shows evolution of a NeRF for 3D generation; and the bottom row shows how the single step denoised variable $x_0(t)$ changes with $t$. Right: Each step of DDIM steps toward a denoised image. This can be seen as a step to $x_0(t)$ and a step back to a slightly less noisy image. Through a change-of-variables we obtain a process on $x_0(t)$.
Figure 5: Overview of SDI. At each training iteration, SDI renders a random view of the 3D shape, runs DDIM inversion up to the noise level $t$, and denoises the image with a pre-trained diffusion model for noise level $t - \tau$. Finally, the denoised image is back-propagated into the 3D shape.
...and 25 more figures

Score Distillation via Reparametrized DDIM

TL;DR

Abstract

Score Distillation via Reparametrized DDIM

Authors

TL;DR

Abstract

Table of Contents

Figures (30)