Table of Contents
Fetching ...

Minority-Focused Text-to-Image Generation via Prompt Optimization

Soobin Um, Jong Chul Ye

TL;DR

This work tackles the challenge of generating minority samples in text-to-image diffusion by addressing the high-density bias of common samplers. It introduces MinorityPrompt, a token-based online prompt optimization method that appends a learnable token to user prompts and updates it during inference to encourage low-density features while preserving semantics, with a theoretical link to log-likelihood via a carefully crafted objective ${\cal J}_{\cal C}$. The authors demonstrate state-of-the-art performance in minority generation across multiple backbones (including SDv1.5, SDv2.0, and SDXL-LT), show robustness to various solvers, and provide extensive ablations and human studies. Beyond minority generation, they illustrate the framework’s versatility for promoting diversity and potential applicability to other inference-time optimization tasks, releasing code to facilitate adoption.

Abstract

We investigate the generation of minority samples using pretrained text-to-image (T2I) latent diffusion models. Minority instances, in the context of T2I generation, can be defined as ones living on low-density regions of text-conditional data distributions. They are valuable for various applications of modern T2I generators, such as data augmentation and creative AI. Unfortunately, existing pretrained T2I diffusion models primarily focus on high-density regions, largely due to the influence of guided samplers (like CFG) that are essential for high-quality generation. To address this, we present a novel framework to counter the high-density-focus of T2I diffusion models. Specifically, we first develop an online prompt optimization framework that encourages emergence of desired properties during inference while preserving semantic contents of user-provided prompts. We subsequently tailor this generic prompt optimizer into a specialized solver that promotes generation of minority features by incorporating a carefully-crafted likelihood objective. Extensive experiments conducted across various types of T2I models demonstrate that our approach significantly enhances the capability to produce high-quality minority instances compared to existing samplers. Code is available at https://github.com/soobin-um/MinorityPrompt.

Minority-Focused Text-to-Image Generation via Prompt Optimization

TL;DR

This work tackles the challenge of generating minority samples in text-to-image diffusion by addressing the high-density bias of common samplers. It introduces MinorityPrompt, a token-based online prompt optimization method that appends a learnable token to user prompts and updates it during inference to encourage low-density features while preserving semantics, with a theoretical link to log-likelihood via a carefully crafted objective . The authors demonstrate state-of-the-art performance in minority generation across multiple backbones (including SDv1.5, SDv2.0, and SDXL-LT), show robustness to various solvers, and provide extensive ablations and human studies. Beyond minority generation, they illustrate the framework’s versatility for promoting diversity and potential applicability to other inference-time optimization tasks, releasing code to facilitate adoption.

Abstract

We investigate the generation of minority samples using pretrained text-to-image (T2I) latent diffusion models. Minority instances, in the context of T2I generation, can be defined as ones living on low-density regions of text-conditional data distributions. They are valuable for various applications of modern T2I generators, such as data augmentation and creative AI. Unfortunately, existing pretrained T2I diffusion models primarily focus on high-density regions, largely due to the influence of guided samplers (like CFG) that are essential for high-quality generation. To address this, we present a novel framework to counter the high-density-focus of T2I diffusion models. Specifically, we first develop an online prompt optimization framework that encourages emergence of desired properties during inference while preserving semantic contents of user-provided prompts. We subsequently tailor this generic prompt optimizer into a specialized solver that promotes generation of minority features by incorporating a carefully-crafted likelihood objective. Extensive experiments conducted across various types of T2I models demonstrate that our approach significantly enhances the capability to produce high-quality minority instances compared to existing samplers. Code is available at https://github.com/soobin-um/MinorityPrompt.

Paper Structure

This paper contains 24 sections, 1 theorem, 23 equations, 11 figures, 8 tables, 2 algorithms.

Key Result

Proposition 1

The objective function in Eq. (eq:popt_ours) is equivalent (upto a constant factor) to the negative ELBO w.r.t. $\log p_{\boldsymbol{\theta}} ( \hat{\boldsymbol{z}}_0 ( \boldsymbol{z}_t, {\cal C}_{\boldsymbol{v}} ) \mid {\cal C} )$ when integrated over timesteps with $\bar{w}_s \coloneqq \alpha_s where $\boldsymbol{z}_{s|t,0} \coloneqq \sqrt{\alpha_s} \hat{\boldsymbol{z}}_0( \boldsymbol{z}_t,

Figures (11)

  • Figure 1: Example results from our minority generation approach using SDXL-Lightning. Our framework is designed to produce unique minority samples w.r.t. user-provided prompts, which are rarely generated by standard samplers like DDIM song2020denoising. Due to its low-likelihood encouraging nature, our sampler often demonstrates counteracting results against demographic biases in text-to-image models friedrich2023fair. See the samples in the last row for instance, where our sampler mitigates prevalent age and racial biases (e.g., associating "man" with "young" and "woman" with "white") by modifying the demographic traits of the subjects.
  • Figure 2: Overview of MinorityPrompt. Unlike existing online prompt tuning approaches that adjust the entire text-embedding (e.g., the output of the text-encoder) during inference, our framework focuses on optimizing a dedicated token-embedding to better preserve the semantics within the prompt. Specifically given a user-prompt (e.g., "A portrait of a dog"), we integrate a placeholder string (e.g., ${\cal S}$ in the figure) into the prompt, marking the position of the learnable token embedding $\boldsymbol{v}$. With the text-embedding ${\cal C}_{\boldsymbol{v}}$ that incorporates the contents of $\boldsymbol{v}$, we update ${\boldsymbol{v}}$on-the-fly during the inference process to maximize the reconstruction loss of the denoised version of ${\boldsymbol{z}}_t$ (i.e., $\hat{\boldsymbol{z}}_0^1$ in the figure). The optimized token ${\boldsymbol v}^*$ is subsequently used to progress the inference at the corresponding timestep; see \ref{['sec:method']} for details.
  • Figure 3: Improved semantic controllability by MinorityPrompt. The samples in the first column are generations due to DDIM using the two base prompts (e.g., “A chef in a white coat leans on a table” for the second row). The second and third columns exhibit generated samples from our framework, where we selected the corresponding word embeddings as the starting points of the prompt optimizations. In the the last column, we also present DDIM samples produced using attached prompts with the corresponding words for comparison. All samples were obtained using SDXL-Lightning lin2024sdxl.
  • Figure 4: Sample comparison on SDXL-Lightning. Generated samples from three different approaches: (i) DDIM song2020denoising; (ii) SGMS um2024self; (iii) MinorityPrompt (ours). Six distinct prompts were used for this comparison, and random seeds were shared across all three methods.
  • Figure 5: Trade-off analysis. The DDIM curves were calculated using a range of CFG weights. In particular, we employed: $w \in \{1.0, 2.0, \ldots, 5.0, 7.5, 9.0, 12.5\}$. For the SGMS baseline um2024self, we fixed the CFG weight as $w=7.5$ and swept the learning rate (i.e., $\eta_t$ in Eq. (\ref{['eq:sgms_obj']})) over $[ 2 \times 10^{-3}, 2 \times 10^{-2} ]$. Similarly for MinorityPrompt, we shared the same CFG weight of $w=7.5$ while controlling the learning rate (used with AdamGrad in \ref{['alg:popt']}) over $[5 \times 10^{-4}, 4 \times 10^{-3}]$. We highlight that our trade-off is significantly more favorable compared to the baselines that suffer from substantial degradation when attempting to generate low-likelihood samples. We employed SDv1.5 for obtaining the curves.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof