Table of Contents
Fetching ...

Consistent Flow Distillation for Text-to-3D Generation

Runjie Yan, Yinbo Chen, Xiaolong Wang

TL;DR

Consistent Flow Distillation (CFD) addresses SDS-induced limitations in text-to-3D generation by enforcing cross-view flow consistency during diffusion-based distillation. It reformulates guidance through clean-flow variables derived from PF-ODE/SDE and introduces a multi-view Noise Transport Equation to align noise textures across camera views on the object surface, enabling gradient-based optimization of differentiable 3D representations. CFD supports high-quality, diverse 3D outputs with negligible extra cost relative to SDS and applies across NeRF, 3D Gaussian Splatting, and mesh paradigms using various diffusion teachers. Empirically, CFD outperforms prior score-distillation methods on standard quality and alignment metrics, while ablations validate the importance of its noise design and flow-consistency mechanism for robust 3D synthesis from text prompts.

Abstract

Score Distillation Sampling (SDS) has made significant strides in distilling image-generative models for 3D generation. However, its maximum-likelihood-seeking behavior often leads to degraded visual quality and diversity, limiting its effectiveness in 3D applications. In this work, we propose Consistent Flow Distillation (CFD), which addresses these limitations. We begin by leveraging the gradient of the diffusion ODE or SDE sampling process to guide the 3D generation. From the gradient-based sampling perspective, we find that the consistency of 2D image flows across different viewpoints is important for high-quality 3D generation. To achieve this, we introduce multi-view consistent Gaussian noise on the 3D object, which can be rendered from various viewpoints to compute the flow gradient. Our experiments demonstrate that CFD, through consistent flows, significantly outperforms previous methods in text-to-3D generation.

Consistent Flow Distillation for Text-to-3D Generation

TL;DR

Consistent Flow Distillation (CFD) addresses SDS-induced limitations in text-to-3D generation by enforcing cross-view flow consistency during diffusion-based distillation. It reformulates guidance through clean-flow variables derived from PF-ODE/SDE and introduces a multi-view Noise Transport Equation to align noise textures across camera views on the object surface, enabling gradient-based optimization of differentiable 3D representations. CFD supports high-quality, diverse 3D outputs with negligible extra cost relative to SDS and applies across NeRF, 3D Gaussian Splatting, and mesh paradigms using various diffusion teachers. Empirically, CFD outperforms prior score-distillation methods on standard quality and alignment metrics, while ablations validate the importance of its noise design and flow-consistency mechanism for robust 3D synthesis from text prompts.

Abstract

Score Distillation Sampling (SDS) has made significant strides in distilling image-generative models for 3D generation. However, its maximum-likelihood-seeking behavior often leads to degraded visual quality and diversity, limiting its effectiveness in 3D applications. In this work, we propose Consistent Flow Distillation (CFD), which addresses these limitations. We begin by leveraging the gradient of the diffusion ODE or SDE sampling process to guide the 3D generation. From the gradient-based sampling perspective, we find that the consistency of 2D image flows across different viewpoints is important for high-quality 3D generation. To achieve this, we introduce multi-view consistent Gaussian noise on the 3D object, which can be rendered from various viewpoints to compute the flow gradient. Our experiments demonstrate that CFD, through consistent flows, significantly outperforms previous methods in text-to-3D generation.
Paper Structure (51 sections, 3 theorems, 39 equations, 18 figures, 8 tables, 3 algorithms)

This paper contains 51 sections, 3 theorems, 39 equations, 18 figures, 8 tables, 3 algorithms.

Key Result

Proposition 1

In Eq. app:eq:clean-flow-sde, if we define a new variable $\boldsymbol{x}_{\pm}'$ according to then $\boldsymbol{x}_{\pm}'$ and $\boldsymbol{x}_{\pm}$ in Eq. app:eq:diffusion-sde have the same law (probability distribution) for all $t \in [t_s, T]$. i.e. Eq. app:eq:clean-flow-sde and Eq. app:eq:diffusion-sde are equivalent.

Figures (18)

  • Figure 1: Text-to-3D samples of CFD. CFD can generate diverse 3D samples by distilling text-to-image diffusion models. See videos in our project page for additional generation results.
  • Figure 2: Overview of CFD. The 3D representation $\theta$ is generated with decreasing timesteps. At each timestep $t$, different views $g_{\theta}(c)$ are rendered. The 2D image clean flow provides the gradient at timestep $t$ to the views and backpropagates to $\theta$. The right shows the gradient computation in detail: we add a multi-view consistent noise (see Fig. \ref{['fig:main-fig-warp']}) to the rendered image and pass it into the frozen text-to-image diffusion model, gradient is calculated using the model prediction and then backpropagated to $\theta$.
  • Figure 3: Warping consistent noise for query views. To obtain a query view noise map, for each pixel, its vertices are projected onto the object surface, then wrapped to the coordinates in a high-resolution noise map. The values within the region specified by the coordinates on the high-resolution noise map are summed and normalized as the return pixel value in the query view noise.
  • Figure 4: Visual comparison to baseline methods. We compare rendered images of our method with baselines include DreamFusion poole2022dreamfusion, ProlificDreamer wang2024prolificdreamer, HiFA zhu2023hifa, LucidDreamer liang2023luciddreamer. The images of baselines are from their official implementations. Prompts: "A 3D model of an adorable cottage with a thatched roof" (top) and "A DSLR photo of an ice cream sundae" (bottom).
  • Figure 5: Ablation on the noise design and the flow space. (a) Directly training $\theta$ with original PF-ODE using Eq. \ref{['eq:noisy-guide-color']} with noisy variable. (b) Distilling with bilinear-interpolated noise map. (c) Distilling with random noise. (d) Distilling with our multi-view consistent Gaussian noise, which has the best visual quality.
  • ...and 13 more figures

Theorems & Definitions (6)

  • Proposition 1: Clean flow SDE is equivalent to diffusion SDE
  • proof : proof
  • Lemma 1: Sample predictions are non-noisy images
  • proof : proof
  • Proposition 2: $\hat{\boldsymbol{x}}^{\text{c}}_{\pm}$ are non-noisy images
  • proof : proof