Table of Contents
Fetching ...

Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization

Debin Meng, Chen Jin, Zheng Gao, Yanran Li, Ioannis Patras, Georgios Tzimiropoulos

TL;DR

TPSO addresses persistent limited diversity and mode collapse in text-to-image diffusion models by introducing a training-free, model-agnostic dual-space approach. It jointly optimizes learnable token embedding offsets to explore underrepresented regions and enforces semantic alignment in the prompt embedding space to prevent drift, forming a diversity plus fidelity balance. The method leverages a semantic alignment loss with a target threshold and a diversity loss across variant prompts, plus a progressive embedding scheduler to apply diversity early in sampling. Across MS-COCO and multiple diffusion backbones, TPSO yields substantial diversity gains with only modest changes to alignment and image quality, demonstrating practical applicability for broad diffusion pipelines.

Abstract

Image diversity remains a fundamental challenge for text-to-image diffusion models. Low-diversity models tend to generate repetitive outputs, increasing sampling redundancy and hindering both creative exploration and downstream applications. A primary cause is that generation often collapses toward a strong mode in the learned distribution. Existing attempts to improve diversity, such as noise resampling, prompt rewriting, or steering-based guidance, often still collapse to dominant modes or introduce distortions that degrade image quality. In light of this, we propose Token-Prompt embedding Space Optimization (TPSO), a training-free and model-agnostic module. TPSO introduces learnable parameters to explore underrepresented regions of the token embedding space, reducing the tendency of the model to repeatedly generate samples from strong modes of the learned distribution. At the same time, the prompt-level space provides a global semantic constraint that regulates distribution shifts, preventing quality degradation while maintaining high fidelity. Extensive experiments on MS-COCO and three diffusion backbones show that TPSO significantly enhances generative diversity, improving baseline performance from 1.10 to 4.18 points, without sacrificing image quality. Code will be released upon acceptance.

Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization

TL;DR

TPSO addresses persistent limited diversity and mode collapse in text-to-image diffusion models by introducing a training-free, model-agnostic dual-space approach. It jointly optimizes learnable token embedding offsets to explore underrepresented regions and enforces semantic alignment in the prompt embedding space to prevent drift, forming a diversity plus fidelity balance. The method leverages a semantic alignment loss with a target threshold and a diversity loss across variant prompts, plus a progressive embedding scheduler to apply diversity early in sampling. Across MS-COCO and multiple diffusion backbones, TPSO yields substantial diversity gains with only modest changes to alignment and image quality, demonstrating practical applicability for broad diffusion pipelines.

Abstract

Image diversity remains a fundamental challenge for text-to-image diffusion models. Low-diversity models tend to generate repetitive outputs, increasing sampling redundancy and hindering both creative exploration and downstream applications. A primary cause is that generation often collapses toward a strong mode in the learned distribution. Existing attempts to improve diversity, such as noise resampling, prompt rewriting, or steering-based guidance, often still collapse to dominant modes or introduce distortions that degrade image quality. In light of this, we propose Token-Prompt embedding Space Optimization (TPSO), a training-free and model-agnostic module. TPSO introduces learnable parameters to explore underrepresented regions of the token embedding space, reducing the tendency of the model to repeatedly generate samples from strong modes of the learned distribution. At the same time, the prompt-level space provides a global semantic constraint that regulates distribution shifts, preventing quality degradation while maintaining high fidelity. Extensive experiments on MS-COCO and three diffusion backbones show that TPSO significantly enhances generative diversity, improving baseline performance from 1.10 to 4.18 points, without sacrificing image quality. Code will be released upon acceptance.

Paper Structure

This paper contains 18 sections, 11 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Baseline diffusion models tend to produce low-diversity or nearly repeated outputs across sampling runs, leading to redundant sampling efforts. In contrast, our approach enhances output diversity, yielding a broader range of plausible and distinct visual outcomes.
  • Figure 2: Overview of how our method explores underrepresented regions in the token embedding space under semantic constraints while avoiding explores invalid regions that cause semantic drift. (a) TPSO updates token embeddings away from the default fixed embeddings (orange) toward underrepresented regions. The green semantic-consistency surface conceptually represents our semantic constraint, ensuring that the resulting prompt embeddings remain meaning-preserving. (b) Interpolating between the token embeddings for 'cat’ and 'lion’ produces smooth semantic transitions, revealing underrepresented yet meaningful regions. However, drifting too far leads to invalid exploration, where semantic alignment breaks and generated images begin to swap or mix characteristics of 'cat’ and 'lion’.
  • Figure 3: Overview of the TPSO pipeline. TPSO optimizes lightweight learnable offsets in the token embedding space to produce multiple semantically aligned prompt variants, which induce diverse conditional representations and yield different CFG guidance directions. This enables substantially richer generative diversity, all without modifying the diffusion model or introducing any additional components.
  • Figure 4: TPSO produces a wide range of visually distinct yet semantically faithful variants, demonstrating its ability to increase diversity without sacrificing image quality. In contrast, baseline diffusion models generate highly repetitive outputs across different sampling runs, revealing limited diversity and a strong tendency to remain near dominant patterns in the learned distribution.
  • Figure 5: Across three diffusion backbones (from top to bottom: SD1.5, SD2.1, and SD3.5), incorporating the diversity loss produces a broader range of plausible variations with richer structural and fine-grained differences, while maintaining visual fidelity. Each group compares generations with and without the diversity loss under the same prompt and sampling setup.