Table of Contents
Fetching ...

Stable Score Distillation for High-Quality 3D Generation

Boshi Tang, Jianan Wang, Zhiyong Wu, Lei Zhang

TL;DR

This work provides a theoretical framework for SDS by decomposing its estimator into mode-disengaging, mode-seeking, and variance-reducing components, identifying the root causes of over-smoothing and implausibility. It introduces Stable Score Distillation (SSD), a simple, timesteps-aware estimator that combines these terms with adaptive variance reduction to improve 3D content quality while remaining compatible with existing diffusion-based frameworks. The authors validate SSD through numerical simulations and text-to-3D experiments, showing better alignment with prompts, crisper geometry, and richer color, along with extensive ablations and proofs of key properties. The findings offer practical guidance for 3D generation workflows and establish a principled connection between optimization practices and diffusion-based 3D synthesis outcomes.

Abstract

Although Score Distillation Sampling (SDS) has exhibited remarkable performance in conditional 3D content generation, a comprehensive understanding of its formulation is still lacking, hindering the development of 3D generation. In this work, we decompose SDS as a combination of three functional components, namely mode-seeking, mode-disengaging and variance-reducing terms, analyzing the properties of each. We show that problems such as over-smoothness and implausibility result from the intrinsic deficiency of the first two terms and propose a more advanced variance-reducing term than that introduced by SDS. Based on the analysis, we propose a simple yet effective approach named Stable Score Distillation (SSD) which strategically orchestrates each term for high-quality 3D generation and can be readily incorporated to various 3D generation frameworks and 3D representations. Extensive experiments validate the efficacy of our approach, demonstrating its ability to generate high-fidelity 3D content without succumbing to issues such as over-smoothness.

Stable Score Distillation for High-Quality 3D Generation

TL;DR

This work provides a theoretical framework for SDS by decomposing its estimator into mode-disengaging, mode-seeking, and variance-reducing components, identifying the root causes of over-smoothing and implausibility. It introduces Stable Score Distillation (SSD), a simple, timesteps-aware estimator that combines these terms with adaptive variance reduction to improve 3D content quality while remaining compatible with existing diffusion-based frameworks. The authors validate SSD through numerical simulations and text-to-3D experiments, showing better alignment with prompts, crisper geometry, and richer color, along with extensive ablations and proofs of key properties. The findings offer practical guidance for 3D generation workflows and establish a principled connection between optimization practices and diffusion-based 3D synthesis outcomes.

Abstract

Although Score Distillation Sampling (SDS) has exhibited remarkable performance in conditional 3D content generation, a comprehensive understanding of its formulation is still lacking, hindering the development of 3D generation. In this work, we decompose SDS as a combination of three functional components, namely mode-seeking, mode-disengaging and variance-reducing terms, analyzing the properties of each. We show that problems such as over-smoothness and implausibility result from the intrinsic deficiency of the first two terms and propose a more advanced variance-reducing term than that introduced by SDS. Based on the analysis, we propose a simple yet effective approach named Stable Score Distillation (SSD) which strategically orchestrates each term for high-quality 3D generation and can be readily incorporated to various 3D generation frameworks and 3D representations. Extensive experiments validate the efficacy of our approach, demonstrating its ability to generate high-fidelity 3D content without succumbing to issues such as over-smoothness.
Paper Structure (34 sections, 13 equations, 20 figures, 1 table, 1 algorithm)

This paper contains 34 sections, 13 equations, 20 figures, 1 table, 1 algorithm.

Figures (20)

  • Figure 1: 3D Gaussian generation from text prompts.
  • Figure 2: NeRF generation from text prompts.
  • Figure 3: Incorporating SSD to existing 3D generation frameworks consistently improves generation quality.
  • Figure 4: Comparisons between SSD and SOTA methods on text-to-3D generation. Baseline results are obtained from their papers.
  • Figure 5: More comparisons between SSD and SOTA methods on text-to-3D generation with more diverse prompts. For each text prompt, baseline results are obtained from theirs papers except for Fantasia3D, and presented on the left, while SSD results are shown on the right.
  • ...and 15 more figures