Table of Contents
Fetching ...

FairQueue: Rethinking Prompt Learning for Fair Text-to-Image Generation

Christopher T. H Teo, Milad Abdollahzadeh, Xinda Ma, Ngai-man Cheung

TL;DR

This work reveals abnormalities in the early denoising steps of the T2I model, perpetuating improper global structure that results in degradation in the generated samples, and proposes two ideas: Prompt Queuing and Attention Amplification to address the quality issue.

Abstract

Recently, prompt learning has emerged as the state-of-the-art (SOTA) for fair text-to-image (T2I) generation. Specifically, this approach leverages readily available reference images to learn inclusive prompts for each target Sensitive Attribute (tSA), allowing for fair image generation. In this work, we first reveal that this prompt learning-based approach results in degraded sample quality. Our analysis shows that the approach's training objective -- which aims to align the embedding differences of learned prompts and reference images -- could be sub-optimal, resulting in distortion of the learned prompts and degraded generated images. To further substantiate this claim, as our major contribution, we deep dive into the denoising subnetwork of the T2I model to track down the effect of these learned prompts by analyzing the cross-attention maps. In our analysis, we propose a novel prompt switching analysis: I2H and H2I. Furthermore, we propose new quantitative characterization of cross-attention maps. Our analysis reveals abnormalities in the early denoising steps, perpetuating improper global structure that results in degradation in the generated samples. Building on insights from our analysis, we propose two ideas: (i) Prompt Queuing and (ii) Attention Amplification to address the quality issue. Extensive experimental results on a wide range of tSAs show that our proposed method outperforms SOTA approach's image generation quality, while achieving competitive fairness. More resources at FairQueue Project site: https://sutd-visual-computing-group.github.io/FairQueue

FairQueue: Rethinking Prompt Learning for Fair Text-to-Image Generation

TL;DR

This work reveals abnormalities in the early denoising steps of the T2I model, perpetuating improper global structure that results in degradation in the generated samples, and proposes two ideas: Prompt Queuing and Attention Amplification to address the quality issue.

Abstract

Recently, prompt learning has emerged as the state-of-the-art (SOTA) for fair text-to-image (T2I) generation. Specifically, this approach leverages readily available reference images to learn inclusive prompts for each target Sensitive Attribute (tSA), allowing for fair image generation. In this work, we first reveal that this prompt learning-based approach results in degraded sample quality. Our analysis shows that the approach's training objective -- which aims to align the embedding differences of learned prompts and reference images -- could be sub-optimal, resulting in distortion of the learned prompts and degraded generated images. To further substantiate this claim, as our major contribution, we deep dive into the denoising subnetwork of the T2I model to track down the effect of these learned prompts by analyzing the cross-attention maps. In our analysis, we propose a novel prompt switching analysis: I2H and H2I. Furthermore, we propose new quantitative characterization of cross-attention maps. Our analysis reveals abnormalities in the early denoising steps, perpetuating improper global structure that results in degradation in the generated samples. Building on insights from our analysis, we propose two ideas: (i) Prompt Queuing and (ii) Attention Amplification to address the quality issue. Extensive experimental results on a wide range of tSAs show that our proposed method outperforms SOTA approach's image generation quality, while achieving competitive fairness. More resources at FairQueue Project site: https://sutd-visual-computing-group.github.io/FairQueue

Paper Structure

This paper contains 26 sections, 4 equations, 43 figures, 8 tables.

Figures (43)

  • Figure 1: Our work re-visits SOTA fair T2I generation, ITI-Gen . We question ITI-Gen 's central idea of prompt learning via alignment between the directions of prompt embeddings and reference image embeddings. (a) We observe degradation in images generated through ITI-Gen's learned prompts. We note that the direction of reference image embeddings could include unrelated concepts beyond tSA differences (e.g., variations in accessories) resulting in learning of distorted prompts using ITI-Gen . Furthermore, we observe misalignment between the direction of credible hard prompts and that of reference images/learned prompts. (b) As our main contribution and to further understand how these distorted prompts affect the image generation process, we deep dive into the denoising network and analyze the cross-attention maps, revealing their abnormalities e.g., higher activity for maps associated with non-tSA tokens ("of", "a"). We examine the degraded global structures resulting from these distorted prompts in the early denoising steps. Moreover, we propose I2H and H2I (Eq.\ref{['eq:I2H']}) analysis to understand impact of these degraded global structures and abnormalities in later denoising steps. In addition, we propose metrics (Eq.\ref{['eq:MainPapercentralMoment']}) on cross-attention maps to quantify these abnormalities. (c) Building on insights from our analysis, we propose a solution to address distorted prompts while maintaining competitive fairness. Our solution FairQueue includes two ideas: prompt queuing and attention amplification. $E_T$ and $E_I$ are CLIP text and image encoder resp. radford2021clip. $\bm{T}$, $\bm{F}$, $\bm{P}$ are the base prompt, hard prompt with minimal linguistic ambiguity, and ITI-Gen prompt, resp.
  • Figure 2: T2I generation performance of HP, ITI-Gen zhang2023itigen, and our proposed FairQueue for target Sensitive Attributes (tSAs) with minimal linguistic ambiguities. Samples generated by HP demonstrate outstanding performance with good fairness (FD), high quality (FID and FD), and good semantic preservation (DS). Meanwhile, ITI-Gen moderately degrades sample quality, impacting fairness and semantic preservation. FairQueue demonstrates comparable performance to HP, even surpassing HP in both quality and semantic preservation in many cases. Note that HP only performs well for unambiguous tSAs, and can not be used for general fair T2I generation purposes, as it can not be defined well for ambiguous tSAs (See Supp for detailed discussion).
  • Figure 3: Comparison of cross-attention maps during the denoising process with HP (left) and ITI-Gen (right). Here, we use tSA=Smiling and plot the denoising process for one sample generation. Each denoising process consists of $l=50$ steps initiated with the same noisy input. Each cell depicts the attention map for the respective token (column) at the respective step (row) overlaid on the input. We highlight 3 key observations: ① ITI-Gen tokens $\bm{S}_i$ have abnormal activities compared to the corresponding tSA-related tokens in HP by attending to unrelated regions (backgrounds) or scattered attention. ② non-tSA tokens like "of" and "a" are abnormally more active in the presence of ITI-Gen tokens. ③ As compared to the HP counterpart, issues created by ITI-Gen tokens (① & ②) degrade the global structure in the early denoising steps (e.g., Step 15), for example, human face in HP vs some unrelated structure in ITI-Gen . The same behavior is observed for some other samples and tSAs (see Supp for more samples, and other tSAs with more denoising steps).
  • Figure 4: Analyzing the accumulated cross-attention maps for the denoising process in our proposed prompt switching analysis I2H and H2I. Here, we use two tSAs: Smiling, High Cheekbones. For each tSA, we show the accumulated cross-attention maps for H2I and I2H, with some quantitative results. In the H2I (I2H) experiment, the first row shows the accumulated cross-attention maps during the early denoising steps with HP (ITI-Gen ) as the input prompt, and the second row shows the maps during later steps, after switching to ITI-Gen (HP). Observation 1:Learned tokens in ITI-Gen affect early denoising steps, degrading global structure synthesis; such degraded global structure disrupts the final output. This is observed in I2H. Observation 2:Learned tokens in ITI-Gen works decently in the later stage of the denoising process if the global structure is synthesized properly. This is observed in H2I. As we show in Supp, similar observations can be made for other samples and other tSAs. Bottom: Histograms of our proposed metrics on cross-attention maps demonstrate the abnormalities in many samples.
  • Figure 5: Ablation Study: Comparing FairQueue performance when varying i) attention amplification factor, $c$ or ii) Prompt Queuing transition point from $\bm{T} \rightarrow \bm{P}$ for tSA Smiling.
  • ...and 38 more figures