Table of Contents
Fetching ...

GASS: Geometry-Aware Spherical Sampling for Disentangled Diversity Enhancement in Text-to-Image Generation

Ye Zhu, Kaleb S. Newman, Johannes F. Lutzeyer, Adriana Romero-Soriano, Michal Drozdzal, Olga Russakovsky

TL;DR

This work introduces Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation and decomposes the diversity measure in CLIP embeddings using two orthogonal directions.

Abstract

Despite high semantic alignment, modern text-to-image (T2I) generative models still struggle to synthesize diverse images from a given prompt. This lack of diversity not only restricts user choice, but also risks amplifying societal biases. In this work, we enhance the T2I diversity through a geometric lens. Unlike most existing methods that rely primarily on entropy-based guidance to increase sample dissimilarity, we introduce Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation. Specifically, we decompose the diversity measure in CLIP embeddings using two orthogonal directions: the text embedding, which captures semantic variation related to the prompt, and an identified orthogonal direction that captures prompt-independent variation (e.g., backgrounds). Based on this decomposition, GASS increases the geometric projection spread of generated image embeddings along both axes and guides the T2I sampling process via expanded predictions along the generation trajectory. Our experiments on different frozen T2I backbones (U-Net and DiT, diffusion and flow) and benchmarks demonstrate the effectiveness of disentangled diversity enhancement with minimal impact on image fidelity and semantic alignment.

GASS: Geometry-Aware Spherical Sampling for Disentangled Diversity Enhancement in Text-to-Image Generation

TL;DR

This work introduces Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation and decomposes the diversity measure in CLIP embeddings using two orthogonal directions.

Abstract

Despite high semantic alignment, modern text-to-image (T2I) generative models still struggle to synthesize diverse images from a given prompt. This lack of diversity not only restricts user choice, but also risks amplifying societal biases. In this work, we enhance the T2I diversity through a geometric lens. Unlike most existing methods that rely primarily on entropy-based guidance to increase sample dissimilarity, we introduce Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation. Specifically, we decompose the diversity measure in CLIP embeddings using two orthogonal directions: the text embedding, which captures semantic variation related to the prompt, and an identified orthogonal direction that captures prompt-independent variation (e.g., backgrounds). Based on this decomposition, GASS increases the geometric projection spread of generated image embeddings along both axes and guides the T2I sampling process via expanded predictions along the generation trajectory. Our experiments on different frozen T2I backbones (U-Net and DiT, diffusion and flow) and benchmarks demonstrate the effectiveness of disentangled diversity enhancement with minimal impact on image fidelity and semantic alignment.
Paper Structure (25 sections, 3 theorems, 19 equations, 8 figures, 5 tables, 3 algorithms)

This paper contains 25 sections, 3 theorems, 19 equations, 8 figures, 5 tables, 3 algorithms.

Key Result

Proposition 4.1

Consider a batch of $B$ points $\mathcal{P}=\{\mathbf{e}_i\}_{i=1}^B \subset \mathbb{S}^{d-1}$ on the CLIP hypersphere, where $\mathbb{S}^{d-1} \subset \mathbb{R}^d$. For each $\mathbf{e}_i$ after our proposed GASS guidance defined in Eq. eq:6, the new set $\tilde{\mathcal{P}}=\{\tilde{\mathbf{e}}_i

Figures (8)

  • Figure 1: Illustration of our geometric decomposition of sample diversity and GASS enhancement method in CLIP space. We decompose the diversity of generated image batches from T2I models in the CLIP hypersphere along two orthogonal axes: text embedding $\mathbf{e}_t$ (i.e., prompt-dependent) and our identified direction $\mathbf{u}_{\text{ind}}$ (i.e., prompt-independent). Our GASS method explicitly expands the geometric spread along both axes, thus enhancing the diversity of generated images across prompt-dependent content (e.g., object viewing angles) and prompt-independent visual attributes (e.g., backgrounds).
  • Figure 2: Illustration of our proposed Geometry-Aware Spherical Sampling (GASS) method. At the generation inference step $t$, original T2I sampling first estimates the predicted clean image $\hat{\mathbf{x}}_{0|t}$ based on the intermediate noisy samples $\mathbf{x}_t$, and then predict the noise we should remove from $\mathbf{x}_t$ to get $\mathbf{x}_{t-1}$. Our GASS alters the predicted clean image from $\hat{\mathbf{x}}_{0|t}$ to $\hat{\mathbf{x}}^*_{0|t}$ through the geometric expansion (see Sec. \ref{['subsec:pertubation']}) and the gradient-based optimization (see Sec. \ref{['subsec:optimization']}), thus guiding the iterative sampling process with frozen generative backbones.
  • Figure 3: Non-cherry-picked qualitative comparisons with other diversity enhancement methods on ImageNet russakovsky2015imagenet and Drawbench saharia2022photorealistic. Compared to other methods (i.e., PG pg2024, CADS sadatcads, IG kynkaanniemi2024ig, and SPELL kirchhof2025spell), our proposed GASS generates images with both richer semantic variation (e.g., object poses and layout) and more detailed and diverse backgrounds.
  • Figure 4: GASS controls the source of diversity by expanding the geometric spread along specified directions. Specially, GASS on prompt-dependent axis $\mathbf{e}_t$ diversifies images through variations via poses and layout, while expansion along prompt-independent direction $\mathbf{u}_{\text{ind}}$ changes attributes like background and styles.
  • Figure 5: GASS still introduces generated image diversity, even when provided with more complex text prompts.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Proposition 4.1: Expected Geometric Volume Guarantee
  • proof
  • Lemma 1.1: Projection Commutativity
  • Theorem 1.2: Determinant Increase