Table of Contents
Fetching ...

PromptMoG: Enhancing Diversity in Long-Prompt Image Generation via Prompt Embedding Mixture-of-Gaussian Sampling

Bo-Kai Ruan, Teng-Fang Hsiao, Ling Lo, Yi-Lun Wu, Hong-Han Shuai

TL;DR

This work addresses the fidelity–diversity trade-off in long-prompt text-to-image generation by introducing LPD-Bench, a benchmark designed to systematically evaluate both fidelity and diversity for long, richly descriptive prompts. It develops a theoretical connection between prompt reformulation and sampling entropy and presents a training-free method, PromptMoG, which forms a Mixture-of-Gaussians over prompt embeddings to diversify conditioning while preserving semantics. Empirical results across four state-of-the-art rectified-flow models (SD3.5-Large, Flux.1-Krea-Dev, CogView4, Qwen-Image) show that PromptMoG consistently improves long-prompt diversity (as measured by the Vendi Score) without semantic drift, outperforming baselines such as prompt chunking and diverse-flow-style methods. The work offers practical impact by enabling richer creative exploration in long-prompt T2I generation and provides a framework adaptable to other modalities in future work.

Abstract

Recent advances in text-to-image (T2I) generation have achieved remarkable visual outcomes through large-scale rectified flow models. However, how these models behave under long prompts remains underexplored. Long prompts encode rich content, spatial, and stylistic information that enhances fidelity but often suppresses diversity, leading to repetitive and less creative outputs. In this work, we systematically study this fidelity-diversity dilemma and reveal that state-of-the-art models exhibit a clear drop in diversity as prompt length increases. To enable consistent evaluation, we introduce LPD-Bench, a benchmark designed for assessing both fidelity and diversity in long-prompt generation. Building on our analysis, we develop a theoretical framework that increases sampling entropy through prompt reformulation and propose a training-free method, PromptMoG, which samples prompt embeddings from a Mixture-of-Gaussians in the embedding space to enhance diversity while preserving semantics. Extensive experiments on four state-of-the-art models, SD3.5-Large, Flux.1-Krea-Dev, CogView4, and Qwen-Image, demonstrate that PromptMoG consistently improves long-prompt generation diversity without semantic drifting.

PromptMoG: Enhancing Diversity in Long-Prompt Image Generation via Prompt Embedding Mixture-of-Gaussian Sampling

TL;DR

This work addresses the fidelity–diversity trade-off in long-prompt text-to-image generation by introducing LPD-Bench, a benchmark designed to systematically evaluate both fidelity and diversity for long, richly descriptive prompts. It develops a theoretical connection between prompt reformulation and sampling entropy and presents a training-free method, PromptMoG, which forms a Mixture-of-Gaussians over prompt embeddings to diversify conditioning while preserving semantics. Empirical results across four state-of-the-art rectified-flow models (SD3.5-Large, Flux.1-Krea-Dev, CogView4, Qwen-Image) show that PromptMoG consistently improves long-prompt diversity (as measured by the Vendi Score) without semantic drift, outperforming baselines such as prompt chunking and diverse-flow-style methods. The work offers practical impact by enabling richer creative exploration in long-prompt T2I generation and provides a framework adaptable to other modalities in future work.

Abstract

Recent advances in text-to-image (T2I) generation have achieved remarkable visual outcomes through large-scale rectified flow models. However, how these models behave under long prompts remains underexplored. Long prompts encode rich content, spatial, and stylistic information that enhances fidelity but often suppresses diversity, leading to repetitive and less creative outputs. In this work, we systematically study this fidelity-diversity dilemma and reveal that state-of-the-art models exhibit a clear drop in diversity as prompt length increases. To enable consistent evaluation, we introduce LPD-Bench, a benchmark designed for assessing both fidelity and diversity in long-prompt generation. Building on our analysis, we develop a theoretical framework that increases sampling entropy through prompt reformulation and propose a training-free method, PromptMoG, which samples prompt embeddings from a Mixture-of-Gaussians in the embedding space to enhance diversity while preserving semantics. Extensive experiments on four state-of-the-art models, SD3.5-Large, Flux.1-Krea-Dev, CogView4, and Qwen-Image, demonstrate that PromptMoG consistently improves long-prompt generation diversity without semantic drifting.

Paper Structure

This paper contains 54 sections, 36 equations, 25 figures, 4 tables, 1 algorithm.

Figures (25)

  • Figure 1: Qualitative comparison of diversity generated with Flux.1-Krea-Dev. Images are generated from different random seeds using (a) long prompts and (b) short prompts.
  • Figure 2: Comparison of diversity across different prompt lengths using the Vendi Score with InceptionV3. From left to right: using the first sentence, the first three sentences, and all sentences from each long prompt.
  • Figure 3: Illustration of the toy example. (Top) 1D Mixture-of-Gaussians for varying $n$. (Bottom) Estimated entropy $H_n$ alongside the theoretical curve $h+\log n$, together with the Vendi Score.
  • Figure 4: Comparison of different benchmarks. For clarity, the average length of the three smallest datasets is ignored.
  • Figure 5: Qualitative comparison with Flux.1-Krea-Dev (top), CogView4 (middle), and Qwen-Image (bottom). Vendi Scores are shown at the top left, and failed outputs are marked at the bottom left. Prompts are truncated with full versions appear in \ref{['sec:additional_comp']}.
  • ...and 20 more figures

Theorems & Definitions (3)

  • proof
  • proof
  • proof