Table of Contents
Fetching ...

Diverse Text-to-Image Generation via Contrastive Noise Optimization

Byungjun Kim, Soobin Um, Jong Chul Ye

TL;DR

This work tackles limited diversity in text-to-image diffusion by proposing Contrastive Noise Optimization (CNO), a lightweight pre-processing strategy that optimizes a batch of initial noise latents $\mathbf{z}_T$ using an InfoNCE-like loss defined in Tweedie denoising space to promote diverse outputs while anchoring to a reference for fidelity. The method introduces a gamma-regularized attraction and leverages downsampling for efficiency, with stop-gradient to reduce compute. Theoretical insights extend the InfoNCE mutual information bound to include negatives, and empirical results across SD1.5, SDXL, and SD3 demonstrate a superior quality-diversity Pareto frontier with robustness to hyperparameters. CNO achieves strong diversity improvements (MSS, Vendi Score) with minimal overhead, offering a practical, model-agnostic path to richer, text-aligned generations.

Abstract

Text-to-image (T2I) diffusion models have demonstrated impressive performance in generating high-fidelity images, largely enabled by text-guided inference. However, this advantage often comes with a critical drawback: limited diversity, as outputs tend to collapse into similar modes under strong text guidance. Existing approaches typically optimize intermediate latents or text conditions during inference, but these methods deliver only modest gains or remain sensitive to hyperparameter tuning. In this work, we introduce Contrastive Noise Optimization, a simple yet effective method that addresses the diversity issue from a distinct perspective. Unlike prior techniques that adapt intermediate latents, our approach shapes the initial noise to promote diverse outputs. Specifically, we develop a contrastive loss defined in the Tweedie data space and optimize a batch of noise latents. Our contrastive optimization repels instances within the batch to maximize diversity while keeping them anchored to a reference sample to preserve fidelity. We further provide theoretical insights into the mechanism of this preprocessing to substantiate its effectiveness. Extensive experiments across multiple T2I backbones demonstrate that our approach achieves a superior quality-diversity Pareto frontier while remaining robust to hyperparameter choices.

Diverse Text-to-Image Generation via Contrastive Noise Optimization

TL;DR

This work tackles limited diversity in text-to-image diffusion by proposing Contrastive Noise Optimization (CNO), a lightweight pre-processing strategy that optimizes a batch of initial noise latents using an InfoNCE-like loss defined in Tweedie denoising space to promote diverse outputs while anchoring to a reference for fidelity. The method introduces a gamma-regularized attraction and leverages downsampling for efficiency, with stop-gradient to reduce compute. Theoretical insights extend the InfoNCE mutual information bound to include negatives, and empirical results across SD1.5, SDXL, and SD3 demonstrate a superior quality-diversity Pareto frontier with robustness to hyperparameters. CNO achieves strong diversity improvements (MSS, Vendi Score) with minimal overhead, offering a practical, model-agnostic path to richer, text-aligned generations.

Abstract

Text-to-image (T2I) diffusion models have demonstrated impressive performance in generating high-fidelity images, largely enabled by text-guided inference. However, this advantage often comes with a critical drawback: limited diversity, as outputs tend to collapse into similar modes under strong text guidance. Existing approaches typically optimize intermediate latents or text conditions during inference, but these methods deliver only modest gains or remain sensitive to hyperparameter tuning. In this work, we introduce Contrastive Noise Optimization, a simple yet effective method that addresses the diversity issue from a distinct perspective. Unlike prior techniques that adapt intermediate latents, our approach shapes the initial noise to promote diverse outputs. Specifically, we develop a contrastive loss defined in the Tweedie data space and optimize a batch of noise latents. Our contrastive optimization repels instances within the batch to maximize diversity while keeping them anchored to a reference sample to preserve fidelity. We further provide theoretical insights into the mechanism of this preprocessing to substantiate its effectiveness. Extensive experiments across multiple T2I backbones demonstrate that our approach achieves a superior quality-diversity Pareto frontier while remaining robust to hyperparameter choices.

Paper Structure

This paper contains 26 sections, 2 theorems, 25 equations, 8 figures, 4 tables, 2 algorithms.

Key Result

Proposition 1

The InfoNCE loss in Eq. (eq:infonce) satisfies where $B$ denotes the batch size, and $I(X;Y)$ is the mutual information between random variables $X$ and $Y$:

Figures (8)

  • Figure 1: Example results from our diverse image generation approach. Three distinct prompts are used: (top) "A person skiing on a very snowy slope", (middle) "A cow sits in a truck with hay barrels in it", and (bottom) "A man sitting on a couch next to a dog". Standard DDIM (a) exhibits pronounced mode collapse, producing repetitive images and often failing to capture complex compositional details. CADS sadat2024cads (b) improves diversity but still yields limited variation and occasional prompt misalignment. Our method (c) delivers markedly greater diversity and fidelity, generating a wide range of images that remain strongly aligned with the input text.
  • Figure 2: Conceptual overview of contrastive noise optimization. Our method enhances generation diversity by optimizing the initial latent vectors, $\mathbf{z}_T$, prior to the DDIM sampling process. We employ an InfoNCE loss that operates on a batch of noise vectors. This loss function pushes the optimizing sample (blue dot) away from all other negative samples in the batch to maximize separation. To preserve semantic fidelity, this repulsion is counterbalanced by an attraction force that pulls the anchor towards its original, non-optimized version (the positive pair), which acts as a fixed reference point. The attraction coefficient $\gamma$ regulates this anchoring force, stabilizing the fidelity-diversity trade-off. This pre-processing step effectively diversifies the final image outputs without fine-tuning or altering the foundational diffusion sampler.
  • Figure 3: Pareto curves of diverse sampling methods between Vendi Score and text-to-image alignment metrics. For our methods, we use $N_{opt}=5, \gamma =1.0, w=8, \tau=0.1$ in common.
  • Figure 4: Ablation on the window size $w$. The Pareto frontier of PickScore vs. Vendi Score.
  • Figure 5: Qualitative comparison with pre-existing zero-shot diverse generative methods For the prompt "A white rabbit on the moon."(left) and "A green unicorn in a snowy forest"(right), we compare our method (d) with baseline approaches. Our method successfully generates high-fidelity images that are strongly aligned with the text prompts. In contrast, the other methods exhibit various failures.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Proposition 1
  • Proposition 2
  • proof
  • proof