FlowDreamer: Exploring High Fidelity Text-to-3D Generation via Rectified Flow

Hangyu Li; Xiangxiang Chu; Dingyuan Shi; Wang Lin

FlowDreamer: Exploring High Fidelity Text-to-3D Generation via Rectified Flow

Hangyu Li, Xiangxiang Chu, Dingyuan Shi, Wang Lin

TL;DR

This paper addresses the over-smoothing and color-saturation issues in SDS-based text-to-3D generation by replacing the diffusion prior with a pretrained rectified flow model. It first formulates Vector Field Distillation Sampling (VFDS) to adapt SDS to rectified flow, then identifies the root causes of residual smoothing via ODE trajectory analysis. Building on this, FlowDreamer introduces a Unique Couple Matching (UCM) loss that uses a push-backward noise search grounded in the rectified-flow reversibility and coupling to constrain learning along a single trajectory. Empirically, FlowDreamer achieves higher fidelity and richer textual details with faster convergence in both NeRF and 3D Gaussian Splatting, outperforming prior SDS- and diffusion-based methods, and reveals open questions around initialization for NeRF and noise-search strategies. This approach offers a practical, faster, and higher-quality alternative to diffusion priors for text-to-3D generation with broad applicability to multiple 3D representations.

Abstract

Recent advances in text-to-3D generation have made significant progress. In particular, with the pretrained diffusion models, existing methods predominantly use Score Distillation Sampling (SDS) to train 3D models such as Neural RaRecent advances in text-to-3D generation have made significant progress. In particular, with the pretrained diffusion models, existing methods predominantly use Score Distillation Sampling (SDS) to train 3D models such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3D GS). However, a hurdle is that they often encounter difficulties with over-smoothing textures and over-saturating colors. The rectified flow model -- which utilizes a simple ordinary differential equation (ODE) to represent a straight trajectory -- shows promise as an alternative prior to text-to-3D generation. It learns a time-independent vector field, thereby reducing the ambiguity in 3D model update gradients that are calculated using time-dependent scores in the SDS framework. In light of this, we first develop a mathematical analysis to seamlessly integrate SDS with rectified flow model, paving the way for our initial framework known as Vector Field Distillation Sampling (VFDS). However, empirical findings indicate that VFDS still results in over-smoothing outcomes. Therefore, we analyze the grounding reasons for such a failure from the perspective of ODE trajectories. On top, we propose a novel framework, named FlowDreamer, which yields high fidelity results with richer textual details and faster convergence. The key insight is to leverage the coupling and reversible properties of the rectified flow model to search for the corresponding noise, rather than using randomly sampled noise as in VFDS. Accordingly, we introduce a novel Unique Couple Matching (UCM) loss, which guides the 3D model to optimize along the same trajectory.

FlowDreamer: Exploring High Fidelity Text-to-3D Generation via Rectified Flow

TL;DR

Abstract

Paper Structure (17 sections, 17 equations, 18 figures, 4 tables)

This paper contains 17 sections, 17 equations, 18 figures, 4 tables.

Introduction
Related Works
FlowDreamer
VFDS: SDS in the Lens of Rectified flow
Unique Couple Matching Loss
FlowDreamer for NeRF and 3D GS
Experiments
3D Generation Settings
Quantitative Comparisons
Experimental Insights of our FlowDreamer
Conclusion
Some illustrations
Implementation Details
User study
VF-ISM Derivation
...and 2 more sections

Figures (18)

Figure 1: FlowDreamer uses a pretrained rectified flow model to generate high-fidelity results from text prompts. It can generate not only highly realistic objects, such as guns and shoes, but also fantastical ones, such as dragon heads.
Figure 2: An example of over-smoothing results.
Figure 3: Illustration of our FlowDreamer. Images of random views from different camera poses are sampled and then input to the VAE encoder to obtain the latents. We replace the randomly sampled noise $\epsilon$ in VFDS with $\#_\phi[x]$ via the push-backward process. Next, we sample $t$ from $U[0, 1]$ and interpolate to obtain $x_t$. Finally, the UCM loss with the conditional prompt is applied to update the 3D model.
Figure 4: (a): An illustration of the reversible and coupling properties of the rectified flow model. The reversible property indicates that $\epsilon$ can map to $x_0$, and $x_0$ can map to $\epsilon$ by reversing the direction of $v_\phi$. The coupling property indicates that $\epsilon$ and $x_0$ can only form a unique coupling. For example, $\epsilon_2$ and $x_0^2$ form a coupling $(\epsilon_2, x_0^2)$; therefore, $\epsilon_2$ and $x_0^1$ can't form a coupling $(\epsilon_2, x_0^1)$ again. (b): An illustration of the trajectories of diffusion and rectified flow. The gradient direction of the diffusion trajectory varies with different $t$, while the rectified flow roughly remains the same for different $t$ under ideal circumstances.
Figure 5: Illustration for over-smoothing analysis. An image is coupled with multiple randomly sampled noises, causing the 3D model to learn ODE trajectories.
...and 13 more figures

FlowDreamer: Exploring High Fidelity Text-to-3D Generation via Rectified Flow

TL;DR

Abstract

FlowDreamer: Exploring High Fidelity Text-to-3D Generation via Rectified Flow

Authors

TL;DR

Abstract

Table of Contents

Figures (18)