PromptMoG: Enhancing Diversity in Long-Prompt Image Generation via Prompt Embedding Mixture-of-Gaussian Sampling
Bo-Kai Ruan, Teng-Fang Hsiao, Ling Lo, Yi-Lun Wu, Hong-Han Shuai
TL;DR
This work addresses the fidelity–diversity trade-off in long-prompt text-to-image generation by introducing LPD-Bench, a benchmark designed to systematically evaluate both fidelity and diversity for long, richly descriptive prompts. It develops a theoretical connection between prompt reformulation and sampling entropy and presents a training-free method, PromptMoG, which forms a Mixture-of-Gaussians over prompt embeddings to diversify conditioning while preserving semantics. Empirical results across four state-of-the-art rectified-flow models (SD3.5-Large, Flux.1-Krea-Dev, CogView4, Qwen-Image) show that PromptMoG consistently improves long-prompt diversity (as measured by the Vendi Score) without semantic drift, outperforming baselines such as prompt chunking and diverse-flow-style methods. The work offers practical impact by enabling richer creative exploration in long-prompt T2I generation and provides a framework adaptable to other modalities in future work.
Abstract
Recent advances in text-to-image (T2I) generation have achieved remarkable visual outcomes through large-scale rectified flow models. However, how these models behave under long prompts remains underexplored. Long prompts encode rich content, spatial, and stylistic information that enhances fidelity but often suppresses diversity, leading to repetitive and less creative outputs. In this work, we systematically study this fidelity-diversity dilemma and reveal that state-of-the-art models exhibit a clear drop in diversity as prompt length increases. To enable consistent evaluation, we introduce LPD-Bench, a benchmark designed for assessing both fidelity and diversity in long-prompt generation. Building on our analysis, we develop a theoretical framework that increases sampling entropy through prompt reformulation and propose a training-free method, PromptMoG, which samples prompt embeddings from a Mixture-of-Gaussians in the embedding space to enhance diversity while preserving semantics. Extensive experiments on four state-of-the-art models, SD3.5-Large, Flux.1-Krea-Dev, CogView4, and Qwen-Image, demonstrate that PromptMoG consistently improves long-prompt generation diversity without semantic drifting.
