GASS: Geometry-Aware Spherical Sampling for Disentangled Diversity Enhancement in Text-to-Image Generation

Ye Zhu; Kaleb S. Newman; Johannes F. Lutzeyer; Adriana Romero-Soriano; Michal Drozdzal; Olga Russakovsky

GASS: Geometry-Aware Spherical Sampling for Disentangled Diversity Enhancement in Text-to-Image Generation

Ye Zhu, Kaleb S. Newman, Johannes F. Lutzeyer, Adriana Romero-Soriano, Michal Drozdzal, Olga Russakovsky

TL;DR

This work introduces Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation and decomposes the diversity measure in CLIP embeddings using two orthogonal directions.

Abstract

Despite high semantic alignment, modern text-to-image (T2I) generative models still struggle to synthesize diverse images from a given prompt. This lack of diversity not only restricts user choice, but also risks amplifying societal biases. In this work, we enhance the T2I diversity through a geometric lens. Unlike most existing methods that rely primarily on entropy-based guidance to increase sample dissimilarity, we introduce Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation. Specifically, we decompose the diversity measure in CLIP embeddings using two orthogonal directions: the text embedding, which captures semantic variation related to the prompt, and an identified orthogonal direction that captures prompt-independent variation (e.g., backgrounds). Based on this decomposition, GASS increases the geometric projection spread of generated image embeddings along both axes and guides the T2I sampling process via expanded predictions along the generation trajectory. Our experiments on different frozen T2I backbones (U-Net and DiT, diffusion and flow) and benchmarks demonstrate the effectiveness of disentangled diversity enhancement with minimal impact on image fidelity and semantic alignment.

GASS: Geometry-Aware Spherical Sampling for Disentangled Diversity Enhancement in Text-to-Image Generation

TL;DR

Abstract

Paper Structure (25 sections, 3 theorems, 19 equations, 8 figures, 5 tables, 3 algorithms)

This paper contains 25 sections, 3 theorems, 19 equations, 8 figures, 5 tables, 3 algorithms.

Introduction
Related Work
Diversity Evaluation and Measurement in T2I
Methods for Enhanced T2I Diversity
Latent Space Analysis for Generative Models
Spherically Disentangled Diversity Measure
Motivation and Problem Formulation
Spherical Disentanglement and Residual Analysis
Spherical Spread Score for Diversity Measure
GASS for Improved T2I Diversity
Latent Dynamic Spherical Guidance
SPP Gradient Optimization for T2I Generation
Experiments
Experimental Setup
Main Results and Analysis
...and 10 more sections

Key Result

Proposition 4.1

Consider a batch of $B$ points $\mathcal{P}=\{\mathbf{e}_i\}_{i=1}^B \subset \mathbb{S}^{d-1}$ on the CLIP hypersphere, where $\mathbb{S}^{d-1} \subset \mathbb{R}^d$. For each $\mathbf{e}_i$ after our proposed GASS guidance defined in Eq. eq:6, the new set $\tilde{\mathcal{P}}=\{\tilde{\mathbf{e}}_i

Figures (8)

Figure 1: Illustration of our geometric decomposition of sample diversity and GASS enhancement method in CLIP space. We decompose the diversity of generated image batches from T2I models in the CLIP hypersphere along two orthogonal axes: text embedding $\mathbf{e}_t$ (i.e., prompt-dependent) and our identified direction $\mathbf{u}_{\text{ind}}$ (i.e., prompt-independent). Our GASS method explicitly expands the geometric spread along both axes, thus enhancing the diversity of generated images across prompt-dependent content (e.g., object viewing angles) and prompt-independent visual attributes (e.g., backgrounds).
Figure 2: Illustration of our proposed Geometry-Aware Spherical Sampling (GASS) method. At the generation inference step $t$, original T2I sampling first estimates the predicted clean image $\hat{\mathbf{x}}_{0|t}$ based on the intermediate noisy samples $\mathbf{x}_t$, and then predict the noise we should remove from $\mathbf{x}_t$ to get $\mathbf{x}_{t-1}$. Our GASS alters the predicted clean image from $\hat{\mathbf{x}}_{0|t}$ to $\hat{\mathbf{x}}^*_{0|t}$ through the geometric expansion (see Sec. \ref{['subsec:pertubation']}) and the gradient-based optimization (see Sec. \ref{['subsec:optimization']}), thus guiding the iterative sampling process with frozen generative backbones.
Figure 3: Non-cherry-picked qualitative comparisons with other diversity enhancement methods on ImageNet russakovsky2015imagenet and Drawbench saharia2022photorealistic. Compared to other methods (i.e., PG pg2024, CADS sadatcads, IG kynkaanniemi2024ig, and SPELL kirchhof2025spell), our proposed GASS generates images with both richer semantic variation (e.g., object poses and layout) and more detailed and diverse backgrounds.
Figure 4: GASS controls the source of diversity by expanding the geometric spread along specified directions. Specially, GASS on prompt-dependent axis $\mathbf{e}_t$ diversifies images through variations via poses and layout, while expansion along prompt-independent direction $\mathbf{u}_{\text{ind}}$ changes attributes like background and styles.
Figure 5: GASS still introduces generated image diversity, even when provided with more complex text prompts.
...and 3 more figures

Theorems & Definitions (4)

Proposition 4.1: Expected Geometric Volume Guarantee
proof
Lemma 1.1: Projection Commutativity
Theorem 1.2: Determinant Increase

GASS: Geometry-Aware Spherical Sampling for Disentangled Diversity Enhancement in Text-to-Image Generation

TL;DR

Abstract

GASS: Geometry-Aware Spherical Sampling for Disentangled Diversity Enhancement in Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (4)