Table of Contents
Fetching ...

DiverseDream: Diverse Text-to-3D Synthesis with Augmented Text Embedding

Uy Dieu Tran, Minh Luu, Phong Ha Nguyen, Khoi Nguyen, Binh-Son Hua

TL;DR

DiverseDream tackles the limited diversity in text-to-3D synthesis by augmenting the text prompt with per-particle HiPer tokens learned from 2D reference images and a shared domain adaptor, enabling joint generation of multiple 3D models from a single prompt. The method unfolds in two stages: HiPer tokens inversion to obtain per-reference-image tokens, followed by Textual Score Distillation (TSD) that conditions each particle with augmented prompts [y; h^*_i; φ], achieving higher diversity than prior SDS/VSD approaches while maintaining quality. Empirical results show improved diversity via IV and Cosine Similarity metrics and qualitative variety, plus an extension to 3D Gaussian Splatting that reduces training time. Limitations include dependence on HiPer inversions and the Janus problem; the work also suggests applying augmented text embedding to other 3D representations and diffusion-guided pipelines for broader impact.

Abstract

Text-to-3D synthesis has recently emerged as a new approach to sampling 3D models by adopting pretrained text-to-image models as guiding visual priors. An intriguing but underexplored problem with existing text-to-3D methods is that 3D models obtained from the sampling-by-optimization procedure tend to have mode collapses, and hence poor diversity in their results. In this paper, we provide an analysis and identify potential causes of such a limited diversity, which motivates us to devise a new method that considers the joint generation of different 3D models from the same text prompt. We propose to use augmented text prompts via textual inversion of reference images to diversify the joint generation. We show that our method leads to improved diversity in text-to-3D synthesis qualitatively and quantitatively. Project page: https://diversedream.github.io

DiverseDream: Diverse Text-to-3D Synthesis with Augmented Text Embedding

TL;DR

DiverseDream tackles the limited diversity in text-to-3D synthesis by augmenting the text prompt with per-particle HiPer tokens learned from 2D reference images and a shared domain adaptor, enabling joint generation of multiple 3D models from a single prompt. The method unfolds in two stages: HiPer tokens inversion to obtain per-reference-image tokens, followed by Textual Score Distillation (TSD) that conditions each particle with augmented prompts [y; h^*_i; φ], achieving higher diversity than prior SDS/VSD approaches while maintaining quality. Empirical results show improved diversity via IV and Cosine Similarity metrics and qualitative variety, plus an extension to 3D Gaussian Splatting that reduces training time. Limitations include dependence on HiPer inversions and the Janus problem; the work also suggests applying augmented text embedding to other 3D representations and diffusion-guided pipelines for broader impact.

Abstract

Text-to-3D synthesis has recently emerged as a new approach to sampling 3D models by adopting pretrained text-to-image models as guiding visual priors. An intriguing but underexplored problem with existing text-to-3D methods is that 3D models obtained from the sampling-by-optimization procedure tend to have mode collapses, and hence poor diversity in their results. In this paper, we provide an analysis and identify potential causes of such a limited diversity, which motivates us to devise a new method that considers the joint generation of different 3D models from the same text prompt. We propose to use augmented text prompts via textual inversion of reference images to diversify the joint generation. We show that our method leads to improved diversity in text-to-3D synthesis qualitatively and quantitatively. Project page: https://diversedream.github.io
Paper Structure (13 sections, 10 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 13 sections, 10 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: We address the intriguing low-diversity issue in text-to-3D synthesis by reconsidering the text prompt used by variational score distillation wang2023prolificdreamer. We propose to use reference images to sample augmented text prompts via textual inversion and use these augmented text prompts to condition the particles in the variational inference of text-to-3D optimization to learn more diverse 3D representations. Thanks to the diversity in the reference images (top-left inline images), we obtain diverse 3D models that inherit certain structures from their references.
  • Figure 2: We present a simulation of SDS (first row) and VSD (second row) in KL form on a 1D toy dataset, where the ground truth distribution $p_{\text{SD}}( x_{t}|y)$ is a 7-component Gaussian mixture model. Results are shown at $t=1$ (low noise data). In the third row and forth row, varying $p_{\text{SD}}( x_{t}|y')$ with a new text prompt $y'$ leads to diverse outcomes across different runs with SDS/VSD loss, motivating our approach.
  • Figure 3: We translate the diversity of augmented text prompts to the resulting 3D models via a two-stage method. Stage 1: HiPer tokens inversion (left): for each reference image, we seek to learn a HiPer token $h_i$ so that the prompt $[y; h_i]$ reconstructs the reference image. Stage 2: Textual score distillation (right): we run a multi-particle variational inference for optimizing the 3D models from text prompt $y$. For each iteration in the optimization, we randomly sample a particle $\theta_i$ with its rendered image $x_i$. We use the augmented text prompt $y'_i = [y; h^*_i;\phi]$, with $\phi$ as shared embedding to condition the optimization of $\theta_i$ (Eq. \ref{['eq:tsd']} and Eq. \ref{['eq:textual_phi']}).
  • Figure 4: Optimization progress of VSD (upper) vs ours (lower). TSD with less #learnable parameters converges faster than VSD. Prompt: "A high-quality ice cream sundae".
  • Figure 5: Diversity comparison between SOTAs and our method.
  • ...and 4 more figures