Table of Contents
Fetching ...

OrientDream: Streamlining Text-to-3D Generation with Explicit Orientation Control

Yuzhong Huang, Zhong Li, Zhang Chen, Zhiyuan Ren, Guosheng Lin, Fred Morstatter, Yi Xu

TL;DR

OrientDream is introduced, a camera orientation conditioned framework designed for efficient and multi-view consistent 3D generation from textual prompts that not only produces high-quality NeRF models with consistent multi-view properties but also achieves an optimization speed significantly greater than existing methods, as quantified by comparative metrics.

Abstract

In the evolving landscape of text-to-3D technology, Dreamfusion has showcased its proficiency by utilizing Score Distillation Sampling (SDS) to optimize implicit representations such as NeRF. This process is achieved through the distillation of pretrained large-scale text-to-image diffusion models. However, Dreamfusion encounters fidelity and efficiency constraints: it faces the multi-head Janus issue and exhibits a relatively slow optimization process. To circumvent these challenges, we introduce OrientDream, a camera orientation conditioned framework designed for efficient and multi-view consistent 3D generation from textual prompts. Our strategy emphasizes the implementation of an explicit camera orientation conditioned feature in the pre-training of a 2D text-to-image diffusion module. This feature effectively utilizes data from MVImgNet, an extensive external multi-view dataset, to refine and bolster its functionality. Subsequently, we utilize the pre-conditioned 2D images as a basis for optimizing a randomly initialized implicit representation (NeRF). This process is significantly expedited by a decoupled back-propagation technique, allowing for multiple updates of implicit parameters per optimization cycle. Our experiments reveal that our method not only produces high-quality NeRF models with consistent multi-view properties but also achieves an optimization speed significantly greater than existing methods, as quantified by comparative metrics.

OrientDream: Streamlining Text-to-3D Generation with Explicit Orientation Control

TL;DR

OrientDream is introduced, a camera orientation conditioned framework designed for efficient and multi-view consistent 3D generation from textual prompts that not only produces high-quality NeRF models with consistent multi-view properties but also achieves an optimization speed significantly greater than existing methods, as quantified by comparative metrics.

Abstract

In the evolving landscape of text-to-3D technology, Dreamfusion has showcased its proficiency by utilizing Score Distillation Sampling (SDS) to optimize implicit representations such as NeRF. This process is achieved through the distillation of pretrained large-scale text-to-image diffusion models. However, Dreamfusion encounters fidelity and efficiency constraints: it faces the multi-head Janus issue and exhibits a relatively slow optimization process. To circumvent these challenges, we introduce OrientDream, a camera orientation conditioned framework designed for efficient and multi-view consistent 3D generation from textual prompts. Our strategy emphasizes the implementation of an explicit camera orientation conditioned feature in the pre-training of a 2D text-to-image diffusion module. This feature effectively utilizes data from MVImgNet, an extensive external multi-view dataset, to refine and bolster its functionality. Subsequently, we utilize the pre-conditioned 2D images as a basis for optimizing a randomly initialized implicit representation (NeRF). This process is significantly expedited by a decoupled back-propagation technique, allowing for multiple updates of implicit parameters per optimization cycle. Our experiments reveal that our method not only produces high-quality NeRF models with consistent multi-view properties but also achieves an optimization speed significantly greater than existing methods, as quantified by comparative metrics.
Paper Structure (16 sections, 6 equations, 7 figures, 4 tables)

This paper contains 16 sections, 6 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: 3D objects generated by OrientDream pipeline using text input. Our OrientDream pipeline successfully generates 3D objects with high-quality textures, effectively free from the view inconsistencies commonly referred to as the Janus problem. We also present visualizations of geometry and normal maps alongside each result for further detail and clarity.
  • Figure 2: Illustration of the Janus Problem: This figure showcases typical Janus issue manifestations, where the bunny is depicted with three ears, the pig with two noses, the eagle with a pair of heads, and the frog anomalously having three back legs but only one front leg.
  • Figure 3: Overview of camera orientation conditioned Diffusion Model: This figure illustrates the core components of our innovative model within the OrientDream pipeline. It highlights how we integrate encoded camera orientations with text inputs, utilizing quaternion forms and sine encoding for precise view angle differentiation. The model, fine-tuned on the real-world MVImgDataset, demonstrates our approach to enhancing 3D consistency and accuracy in NeRF generation, surpassing common limitations found in models trained solely on synthetic 3D datasets.
  • Figure 4: Overview of Text-to-3D Generation Methodologies: This figure succinctly illustrates our dual approach in applying multi-view diffusion models for 3D generation. It highlights the use of multi-viewpoint images for sparse 3D reconstruction, alongside our focus on employing an orientation-conditioned diffusion model for Score Distillation Sampling (SDS). We figure showcases our innovative SDS adaptation, where traditional models are replaced with our orientation conditioned diffusion model, seamlessly integrating camera parameters and text prompts for enhanced 3D content generation.
  • Figure 5: Decoupled Sampling: We summarizes our approach to enhance NeRF optimization speed, highlighting the shift from uniform sampling to targeted reduction in $T$ steps and the use of the DDIM solver for efficient computation, thereby improving both the diversity and texture quality in 3D model generation.
  • ...and 2 more figures