Table of Contents
Fetching ...

GeoDream: Disentangling 2D and Geometric Priors for High-Fidelity and Consistent 3D Generation

Baorui Ma, Haoge Deng, Junsheng Zhou, Yu-Shen Liu, Tiejun Huang, Xinlong Wang

TL;DR

GeoDream tackles the Janus problem in text-to-3D by disentangling explicit 3D priors from 2D diffusion priors. It learns native 3D priors via a cost-volume-based geometry/texture pipeline and then refines them using predictions from diverse multi-view diffusion models in a disentangled fashion, enabling robust 3D consistency and high-fidelity 1024×1024 renderings. The approach yields more realistic textured meshes, better semantic coherence (via Uni3D_score), and strong 3D stability across baselines, with thorough ablations confirming the value of 3D priors, viewpoint sampling, and staged optimization. The work demonstrates that decoupling 3D and 2D priors unlocks 3D awareness in 2D diffusion while preserving generalization and creativity, pushing text-to-3D toward practical, scalable production-ready results.

Abstract

Text-to-3D generation by distilling pretrained large-scale text-to-image diffusion models has shown great promise but still suffers from inconsistent 3D geometric structures (Janus problems) and severe artifacts. The aforementioned problems mainly stem from 2D diffusion models lacking 3D awareness during the lifting. In this work, we present GeoDream, a novel method that incorporates explicit generalized 3D priors with 2D diffusion priors to enhance the capability of obtaining unambiguous 3D consistent geometric structures without sacrificing diversity or fidelity. Specifically, we first utilize a multi-view diffusion model to generate posed images and then construct cost volume from the predicted image, which serves as native 3D geometric priors, ensuring spatial consistency in 3D space. Subsequently, we further propose to harness 3D geometric priors to unlock the great potential of 3D awareness in 2D diffusion priors via a disentangled design. Notably, disentangling 2D and 3D priors allows us to refine 3D geometric priors further. We justify that the refined 3D geometric priors aid in the 3D-aware capability of 2D diffusion priors, which in turn provides superior guidance for the refinement of 3D geometric priors. Our numerical and visual comparisons demonstrate that GeoDream generates more 3D consistent textured meshes with high-resolution realistic renderings (i.e., 1024 $\times$ 1024) and adheres more closely to semantic coherence.

GeoDream: Disentangling 2D and Geometric Priors for High-Fidelity and Consistent 3D Generation

TL;DR

GeoDream tackles the Janus problem in text-to-3D by disentangling explicit 3D priors from 2D diffusion priors. It learns native 3D priors via a cost-volume-based geometry/texture pipeline and then refines them using predictions from diverse multi-view diffusion models in a disentangled fashion, enabling robust 3D consistency and high-fidelity 1024×1024 renderings. The approach yields more realistic textured meshes, better semantic coherence (via Uni3D_score), and strong 3D stability across baselines, with thorough ablations confirming the value of 3D priors, viewpoint sampling, and staged optimization. The work demonstrates that decoupling 3D and 2D priors unlocks 3D awareness in 2D diffusion while preserving generalization and creativity, pushing text-to-3D toward practical, scalable production-ready results.

Abstract

Text-to-3D generation by distilling pretrained large-scale text-to-image diffusion models has shown great promise but still suffers from inconsistent 3D geometric structures (Janus problems) and severe artifacts. The aforementioned problems mainly stem from 2D diffusion models lacking 3D awareness during the lifting. In this work, we present GeoDream, a novel method that incorporates explicit generalized 3D priors with 2D diffusion priors to enhance the capability of obtaining unambiguous 3D consistent geometric structures without sacrificing diversity or fidelity. Specifically, we first utilize a multi-view diffusion model to generate posed images and then construct cost volume from the predicted image, which serves as native 3D geometric priors, ensuring spatial consistency in 3D space. Subsequently, we further propose to harness 3D geometric priors to unlock the great potential of 3D awareness in 2D diffusion priors via a disentangled design. Notably, disentangling 2D and 3D priors allows us to refine 3D geometric priors further. We justify that the refined 3D geometric priors aid in the 3D-aware capability of 2D diffusion priors, which in turn provides superior guidance for the refinement of 3D geometric priors. Our numerical and visual comparisons demonstrate that GeoDream generates more 3D consistent textured meshes with high-resolution realistic renderings (i.e., 1024 1024) and adheres more closely to semantic coherence.
Paper Structure (21 sections, 9 equations, 12 figures, 3 tables)

This paper contains 21 sections, 9 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: GeoDream alleviates the Janus problems by incorporating explicit 3D priors with 2D diffusion priors. GeoDream generates consistent multi-view rendered images and rich details textured meshes. We remove rendering background to achieve a clearer visualization.
  • Figure 2: The overview of GeoDream. (a) 3D priors training. (b) Incorporating 3D priors with 2D diffusion priors.
  • Figure 3: Qualitative comparison with baselines. Back views are highlighted with red rectangles for distinct observation of multiple faces.
  • Figure 4: Ablation study of proposed improvements for text-to-3D generation.
  • Figure 5: More visualization comparisons with baselines. For each row from up to down, the given prompts are: (1) 3D render of a statue of an astronaut. (2) 3D stylized game little building. (3) A brightly colored mushroom growing on a log. (4) An ice-cream cone
  • ...and 7 more figures