Consistent Flow Distillation for Text-to-3D Generation
Runjie Yan, Yinbo Chen, Xiaolong Wang
TL;DR
Consistent Flow Distillation (CFD) addresses SDS-induced limitations in text-to-3D generation by enforcing cross-view flow consistency during diffusion-based distillation. It reformulates guidance through clean-flow variables derived from PF-ODE/SDE and introduces a multi-view Noise Transport Equation to align noise textures across camera views on the object surface, enabling gradient-based optimization of differentiable 3D representations. CFD supports high-quality, diverse 3D outputs with negligible extra cost relative to SDS and applies across NeRF, 3D Gaussian Splatting, and mesh paradigms using various diffusion teachers. Empirically, CFD outperforms prior score-distillation methods on standard quality and alignment metrics, while ablations validate the importance of its noise design and flow-consistency mechanism for robust 3D synthesis from text prompts.
Abstract
Score Distillation Sampling (SDS) has made significant strides in distilling image-generative models for 3D generation. However, its maximum-likelihood-seeking behavior often leads to degraded visual quality and diversity, limiting its effectiveness in 3D applications. In this work, we propose Consistent Flow Distillation (CFD), which addresses these limitations. We begin by leveraging the gradient of the diffusion ODE or SDE sampling process to guide the 3D generation. From the gradient-based sampling perspective, we find that the consistency of 2D image flows across different viewpoints is important for high-quality 3D generation. To achieve this, we introduce multi-view consistent Gaussian noise on the 3D object, which can be rendered from various viewpoints to compute the flow gradient. Our experiments demonstrate that CFD, through consistent flows, significantly outperforms previous methods in text-to-3D generation.
