Flow Score Distillation for Diverse Text-to-3D Generation

Runjie Yan; Kailu Wu; Kaisheng Ma

Flow Score Distillation for Diverse Text-to-3D Generation

Runjie Yan, Kailu Wu, Kaisheng Ma

TL;DR

Flow Score Distillation (FSD) tackles the diversity limitation of Score Distillation Sampling (SDS) in text-to-3D generation by linking SDS to the DDIM PF-ODE formulation and replacing stochastic noise with a deterministic, view-coherent noise strategy via a world-map noise function $\boldsymbol{\epsilon}(\boldsymbol{c})$. The method reframes the PF-ODE as an SDS-like loss and uses a monotone timestep schedule to align with DDIM, which improves generation diversity without sacrificing quality. FSD is lifted to 3D by designing a view-dependent noise and a stable 3D rendering loss $L_{FSD}$, with a world-map noise design that avoids geometry holes seen with naive noise. Experiments on Stable Diffusion and MVDream backbones show substantial diversity gains and robust quality, demonstrating that flow-based diffusion priors can be effectively applied to text-to-3D generation.

Abstract

Recent advancements in Text-to-3D generation have yielded remarkable progress, particularly through methods that rely on Score Distillation Sampling (SDS). While SDS exhibits the capability to create impressive 3D assets, it is hindered by its inherent maximum-likelihood-seeking essence, resulting in limited diversity in generation outcomes. In this paper, we discover that the Denoise Diffusion Implicit Models (DDIM) generation process (\ie PF-ODE) can be succinctly expressed using an analogue of SDS loss. One step further, one can see SDS as a generalized DDIM generation process. Following this insight, we show that the noise sampling strategy in the noise addition stage significantly restricts the diversity of generation results. To address this limitation, we present an innovative noise sampling approach and introduce a novel text-to-3D method called Flow Score Distillation (FSD). Our validation experiments across various text-to-image Diffusion Models demonstrate that FSD substantially enhances generation diversity without compromising quality.

Flow Score Distillation for Diverse Text-to-3D Generation

TL;DR

. The method reframes the PF-ODE as an SDS-like loss and uses a monotone timestep schedule to align with DDIM, which improves generation diversity without sacrificing quality. FSD is lifted to 3D by designing a view-dependent noise and a stable 3D rendering loss

, with a world-map noise design that avoids geometry holes seen with naive noise. Experiments on Stable Diffusion and MVDream backbones show substantial diversity gains and robust quality, demonstrating that flow-based diffusion priors can be effectively applied to text-to-3D generation.

Abstract

Paper Structure (53 sections, 1 theorem, 27 equations, 18 figures, 3 tables, 1 algorithm)

This paper contains 53 sections, 1 theorem, 27 equations, 18 figures, 3 tables, 1 algorithm.

Introduction
Preliminaries and Related Works
Diffusion Models
Diffusion PF-ODE and DDIM
Score Distillation Sampling
Flow Score Distillation for 2D Generation
Simplified Formulation of Diffusion PF-ODE
Flow Score Distillation on 2D
Analysis of the Noise Sampling Strategy
Compare FSD with SDS and DDIM on 2D
Noise Sampling Strategy
Optimizer
Diffusion Timestep Schedule
Lifting Flow Score Distillation to 3D
Designing $\boldsymbol{\epsilon}(\boldsymbol{c})$.
...and 38 more sections

Key Result

proposition thmcounterproposition

Diffusion PF-ODE (eq:pf-ode-simple) can be equivalently formulated by an analogue of SDS loss (eq:tiny-sds-2d): where $\boldsymbol{x}_t = \alpha_t \hat{\boldsymbol{x}}^{\text{c}}_t + \sigma_t \tilde{\boldsymbol{\epsilon}}$, $\theta = \hat{\boldsymbol{x}}^{\text{c}}_t$ and $w_t' = \frac{\mathrm{d} (\sigma_t/\alpha_t)}{\mathrm{d} t}$ is a weighting scalar.

Figures (18)

Figure 1: Generation results of FSD and baseline method SDS.FSD uses pretrained text-to-image Diffusion Models to generate realistic 3D models from text prompts. We improve the noise sampling strategy upon SDS and achieve diverse generation results without compromising quality.
Figure 2: Methods overview of FSD. We propose Flow Score Distillation for text-to-3D generation by lifting a pretrained Diffusion Model. FSD renders an image $\boldsymbol{g}_\theta(\boldsymbol{c})$ from the 3D representation and adds noise $\boldsymbol{\epsilon}(\boldsymbol{c})$ to the rendered image. To compute parameter updates according to $L_{\text{FSD}}^\theta$, FSD uses a frozen text-to-image Diffusion Model to predict the noise $\boldsymbol{\epsilon}(\boldsymbol{c})$ added on image $\boldsymbol{g}_\theta(\boldsymbol{c})$. Similar to SDSpoole2022dreamfusionwang2023score, FSD computes $L_{\text{FSD}}^\theta$ by an image reconstruction loss between the "clean image" $\hat{\boldsymbol{x}}^{\text{c}}_t = \boldsymbol{g}_\theta(\boldsymbol{c})$ and "ground-truth image" $\hat{\boldsymbol{x}}_0$ predicted by the pretrained Diffusion Model. FSD further adopts timestep annealing schedule and noise sampling strategy. Instead of sampling noise from Gaussian distribution at each step of the optimization like SDS, we generate noise according to the deterministic noise function $\boldsymbol{\epsilon}(\boldsymbol{c})$, which is determined at the beginning of the optimization.
Figure 3: Generation results of different methods on image space with the same random seeds. FSD can generate images that are very similar to images generated by DDIM given the same initial noise (implied by \ref{['proposition:fsd']}). However, FSD can also be used for 3D generation, a task for which DDIM is not suitable. See experiment details in Appendix.
Figure 4: Visualization of FSD and SDS for image generation. We visualize generation results and the estimated ground-truth images in consecutive steps at halfway of the generation for both FSD and SDSpoole2022dreamfusionsong2020score. We find the estimated ground-truth images of FSD are consistent, while the contents vary greatly in the ground-truth images of SDS. We set CFG=7.5 and adopted the same linear timestep annealing schedule for both FSD and SDS in the experiment for this figure.
Figure 5: Impact of initial noise $\tilde{\boldsymbol{\epsilon}}$. Experiments show that the local textures of noise added during FSD optimization are highly correlated with the textures of the final image. We shuffle the patches of initial noise $\tilde{\boldsymbol{\epsilon}}$ used by FSD and observe that the textures of generated images are shuffled in the same way. This property inspired our design of world-map noise function $\boldsymbol{\epsilon}(\boldsymbol{c})$ for 3D generation in this work. In this figure, the parts framed by dotted lines of the same color share the same initial noise $\tilde{\boldsymbol{\epsilon}}$ patches.
...and 13 more figures

Theorems & Definitions (1)

proposition thmcounterproposition: An equivalent form of Diffusion PF-ODE

Flow Score Distillation for Diverse Text-to-3D Generation

TL;DR

Abstract

Flow Score Distillation for Diverse Text-to-3D Generation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (18)

Theorems & Definitions (1)