Table of Contents
Fetching ...

DreamFlow: High-Quality Text-to-3D Generation by Approximating Probability Flow

Kyungmin Lee, Kihyuk Sohn, Jinwoo Shin

TL;DR

This paper proposes to enhance the text-to-3D optimization by leveraging the T2I diffusion prior in the generative sampling process with a predetermined timestep schedule, and designs DreamFlow, a practical three-stage coarseto-fine text-to-3D optimization framework that enables fast generation of highquality and high-resolution 3D contents.

Abstract

Recent progress in text-to-3D generation has been achieved through the utilization of score distillation methods: they make use of the pre-trained text-to-image (T2I) diffusion models by distilling via the diffusion model training objective. However, such an approach inevitably results in the use of random timesteps at each update, which increases the variance of the gradient and ultimately prolongs the optimization process. In this paper, we propose to enhance the text-to-3D optimization by leveraging the T2I diffusion prior in the generative sampling process with a predetermined timestep schedule. To this end, we interpret text-to3D optimization as a multi-view image-to-image translation problem, and propose a solution by approximating the probability flow. By leveraging the proposed novel optimization algorithm, we design DreamFlow, a practical three-stage coarseto-fine text-to-3D optimization framework that enables fast generation of highquality and high-resolution (i.e., 1024x1024) 3D contents. For example, we demonstrate that DreamFlow is 5 times faster than the existing state-of-the-art text-to-3D method, while producing more photorealistic 3D contents. Visit our project page (https://kyungmnlee.github.io/dreamflow.github.io/) for visualizations.

DreamFlow: High-Quality Text-to-3D Generation by Approximating Probability Flow

TL;DR

This paper proposes to enhance the text-to-3D optimization by leveraging the T2I diffusion prior in the generative sampling process with a predetermined timestep schedule, and designs DreamFlow, a practical three-stage coarseto-fine text-to-3D optimization framework that enables fast generation of highquality and high-resolution 3D contents.

Abstract

Recent progress in text-to-3D generation has been achieved through the utilization of score distillation methods: they make use of the pre-trained text-to-image (T2I) diffusion models by distilling via the diffusion model training objective. However, such an approach inevitably results in the use of random timesteps at each update, which increases the variance of the gradient and ultimately prolongs the optimization process. In this paper, we propose to enhance the text-to-3D optimization by leveraging the T2I diffusion prior in the generative sampling process with a predetermined timestep schedule. To this end, we interpret text-to3D optimization as a multi-view image-to-image translation problem, and propose a solution by approximating the probability flow. By leveraging the proposed novel optimization algorithm, we design DreamFlow, a practical three-stage coarseto-fine text-to-3D optimization framework that enables fast generation of highquality and high-resolution (i.e., 1024x1024) 3D contents. For example, we demonstrate that DreamFlow is 5 times faster than the existing state-of-the-art text-to-3D method, while producing more photorealistic 3D contents. Visit our project page (https://kyungmnlee.github.io/dreamflow.github.io/) for visualizations.
Paper Structure (47 sections, 11 equations, 17 figures, 17 tables, 1 algorithm)

This paper contains 47 sections, 11 equations, 17 figures, 17 tables, 1 algorithm.

Figures (17)

  • Figure 1: Examples of 3D scene generated by . can generate photorealistic 3D models from text prompts with reasonable generation time (e.g., less than 2 hours), which have been possible by elucidated optimization strategy using generative diffusion priors.
  • Figure 2: Proposed 3D optimization method APFO. APFO use predetermined timestep schedule for efficient 3D optimization. At each timestep $t_i$, we sample $\ell_i$ multi-view images from a 3D scene, and update 3D representation by approximation of probability flow computed by Eq. \ref{['eq:final']}.
  • Figure 3: Coarse-to-fine text-to-3D optimization framework of . Our text-to-3D generation is done in coarse-to-fine manner; we first optimize NeRF, then extract 3D mesh and fine-tune. We use same latent diffusion model (denoiser 1) for first and second stage. Lastly, we refine 3D mesh with high-resolution latent diffusion prior (denoiser 2). At each stage, we optimize with different timestep schedule, which effectively utilize the diffusion priors.
  • Figure 4: Qualitative comparison with baseline methods. For each baseline method, we present visual examples with same text prompt is given. Our approach presents more detailed textures.
  • Figure 5: Ablation on the effect of coarse-to-fine text-to-3D optimization. Given stage 1 generates high quality NeRF, stage 2 improves the geometry and texture, and stage 3 refines the 3D mesh to add more photorealistic details.
  • ...and 12 more figures