Table of Contents
Fetching ...

JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation

Chenhan Jiang, Yihan Zeng, Tianyang Hu, Songcun Xu, Wei Zhang, Hang Xu, Dit-Yan Yeung

TL;DR

This work model the joint image distribution, which introduces an energy function to capture the coherence among denoised images from the diffusion model, and derives the joint score distillation on multiple rendered views of the 3D representation, as opposed to a single view in SDS.

Abstract

Score Distillation Sampling (SDS) by well-trained 2D diffusion models has shown great promise in text-to-3D generation. However, this paradigm distills view-agnostic 2D image distributions into the rendering distribution of 3D representation for each view independently, overlooking the coherence across views and yielding 3D inconsistency in generations. In this work, we propose \textbf{J}oint \textbf{S}core \textbf{D}istillation (JSD), a new paradigm that ensures coherent 3D generations. Specifically, we model the joint image distribution, which introduces an energy function to capture the coherence among denoised images from the diffusion model. We then derive the joint score distillation on multiple rendered views of the 3D representation, as opposed to a single view in SDS. In addition, we instantiate three universal view-aware models as energy functions, demonstrating compatibility with JSD. Empirically, JSD significantly mitigates the 3D inconsistency problem in SDS, while maintaining text congruence. Moreover, we introduce the Geometry Fading scheme and Classifier-Free Guidance (CFG) Switching strategy to enhance generative details. Our framework, JointDreamer, establishes a new benchmark in text-to-3D generation, achieving outstanding results with an 88.5\% CLIP R-Precision and 27.7\% CLIP Score. These metrics demonstrate exceptional text congruence, as well as remarkable geometric consistency and texture fidelity.

JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation

TL;DR

This work model the joint image distribution, which introduces an energy function to capture the coherence among denoised images from the diffusion model, and derives the joint score distillation on multiple rendered views of the 3D representation, as opposed to a single view in SDS.

Abstract

Score Distillation Sampling (SDS) by well-trained 2D diffusion models has shown great promise in text-to-3D generation. However, this paradigm distills view-agnostic 2D image distributions into the rendering distribution of 3D representation for each view independently, overlooking the coherence across views and yielding 3D inconsistency in generations. In this work, we propose \textbf{J}oint \textbf{S}core \textbf{D}istillation (JSD), a new paradigm that ensures coherent 3D generations. Specifically, we model the joint image distribution, which introduces an energy function to capture the coherence among denoised images from the diffusion model. We then derive the joint score distillation on multiple rendered views of the 3D representation, as opposed to a single view in SDS. In addition, we instantiate three universal view-aware models as energy functions, demonstrating compatibility with JSD. Empirically, JSD significantly mitigates the 3D inconsistency problem in SDS, while maintaining text congruence. Moreover, we introduce the Geometry Fading scheme and Classifier-Free Guidance (CFG) Switching strategy to enhance generative details. Our framework, JointDreamer, establishes a new benchmark in text-to-3D generation, achieving outstanding results with an 88.5\% CLIP R-Precision and 27.7\% CLIP Score. These metrics demonstrate exceptional text congruence, as well as remarkable geometric consistency and texture fidelity.
Paper Structure (46 sections, 20 equations, 19 figures, 3 tables)

This paper contains 46 sections, 20 equations, 19 figures, 3 tables.

Figures (19)

  • Figure 1: Text-to-3D generations by JointDreamer from scratch. JointDreamer excels in generating geometrically consistent and high-fidelity 3D assets, adhering to complex textual descriptions that are challenging for previous methods.
  • Figure 2: Illustration of text-conditioned images for different viewpoints, where input texts are augmented with corresponding direction prompts for each view. (a) The original generations from 2D diffusion model sd2022 are view-agnostic and inconsistent across views. (b) Text prompt tuning perpneg23 has limited improvement in the directional structure of generated images for each view. (c) JSD injects coherence measurement from the proposed binary classifier (refer to Section \ref{['sec:view']}), contributing to modified directional structures and semantical consistency across views.
  • Figure 3: Overview of JointDreamer Framework. We introduce an energy function to model the joint distribution for multi-view denoised images from 2D diffusion model, facilitating the Joint Score Distillation (JSD) optimization for text-to-3D generation.
  • Figure 4: Illustration of the binary classification model and qualitative results with JSD. (a) The classification model $M_{\text{CLS}}$ produces the binary logit to measure the consistency between two input views $x^i$ and $x^j$. (b) JSD integrated with the classification model effectively alleviates Janus issues compared to SDS.
  • Figure 5: Comparison of text-to-3D generation. See Appendix for more results.
  • ...and 14 more figures