DiverseDream: Diverse Text-to-3D Synthesis with Augmented Text Embedding
Uy Dieu Tran, Minh Luu, Phong Ha Nguyen, Khoi Nguyen, Binh-Son Hua
TL;DR
DiverseDream tackles the limited diversity in text-to-3D synthesis by augmenting the text prompt with per-particle HiPer tokens learned from 2D reference images and a shared domain adaptor, enabling joint generation of multiple 3D models from a single prompt. The method unfolds in two stages: HiPer tokens inversion to obtain per-reference-image tokens, followed by Textual Score Distillation (TSD) that conditions each particle with augmented prompts [y; h^*_i; φ], achieving higher diversity than prior SDS/VSD approaches while maintaining quality. Empirical results show improved diversity via IV and Cosine Similarity metrics and qualitative variety, plus an extension to 3D Gaussian Splatting that reduces training time. Limitations include dependence on HiPer inversions and the Janus problem; the work also suggests applying augmented text embedding to other 3D representations and diffusion-guided pipelines for broader impact.
Abstract
Text-to-3D synthesis has recently emerged as a new approach to sampling 3D models by adopting pretrained text-to-image models as guiding visual priors. An intriguing but underexplored problem with existing text-to-3D methods is that 3D models obtained from the sampling-by-optimization procedure tend to have mode collapses, and hence poor diversity in their results. In this paper, we provide an analysis and identify potential causes of such a limited diversity, which motivates us to devise a new method that considers the joint generation of different 3D models from the same text prompt. We propose to use augmented text prompts via textual inversion of reference images to diversify the joint generation. We show that our method leads to improved diversity in text-to-3D synthesis qualitatively and quantitatively. Project page: https://diversedream.github.io
