Table of Contents
Fetching ...

CoCoNO: Attention Contrast-and-Complete for Initial Noise Optimization in Text-to-Image Synthesis

Aravindan Sundaram, Ujjayan Pal, Abhimanyu Chauhan, Aishwarya Agarwal, Srikrishna Karanam

TL;DR

This work introduces CoCoNO, a new algorithm that optimizes the initial latent by leveraging the complementary information within self-attention and cross-attention maps, and introduces two new loss functions: the attention contrast loss, which minimizes undesirable overlap, and the attention complete loss, which maximizes the activation within these attention zones to guarantee that each subject is fully and distinctly represented.

Abstract

Despite recent advancements in text-to-image models, achieving semantically accurate images in text-to-image diffusion models is a persistent challenge. While existing initial latent optimization methods have demonstrated impressive performance, we identify two key limitations: (a) attention neglect, where the synthesized image omits certain subjects from the input prompt because they do not have a designated segment in the self-attention map despite despite having a high-response cross-attention, and (b) attention interference, where the generated image has mixed-up properties of multiple subjects because of a conflicting overlap between cross- and self-attention maps of different subjects. To address these limitations, we introduce CoCoNO, a new algorithm that optimizes the initial latent by leveraging the complementary information within self-attention and cross-attention maps. Our method introduces two new loss functions: the attention contrast loss, which minimizes undesirable overlap by ensuring each self-attention segment is exclusively linked to a specific subject's cross attention map, and the attention complete loss, which maximizes the activation within these segments to guarantee that each subject is fully and distinctly represented. Our approach operates within a noise optimization framework, avoiding the need to retrain base models. Through extensive experiments on multiple benchmarks, we demonstrate that CoCoNO significantly improves text-image alignment and outperforms the current state of the art.

CoCoNO: Attention Contrast-and-Complete for Initial Noise Optimization in Text-to-Image Synthesis

TL;DR

This work introduces CoCoNO, a new algorithm that optimizes the initial latent by leveraging the complementary information within self-attention and cross-attention maps, and introduces two new loss functions: the attention contrast loss, which minimizes undesirable overlap, and the attention complete loss, which maximizes the activation within these attention zones to guarantee that each subject is fully and distinctly represented.

Abstract

Despite recent advancements in text-to-image models, achieving semantically accurate images in text-to-image diffusion models is a persistent challenge. While existing initial latent optimization methods have demonstrated impressive performance, we identify two key limitations: (a) attention neglect, where the synthesized image omits certain subjects from the input prompt because they do not have a designated segment in the self-attention map despite despite having a high-response cross-attention, and (b) attention interference, where the generated image has mixed-up properties of multiple subjects because of a conflicting overlap between cross- and self-attention maps of different subjects. To address these limitations, we introduce CoCoNO, a new algorithm that optimizes the initial latent by leveraging the complementary information within self-attention and cross-attention maps. Our method introduces two new loss functions: the attention contrast loss, which minimizes undesirable overlap by ensuring each self-attention segment is exclusively linked to a specific subject's cross attention map, and the attention complete loss, which maximizes the activation within these segments to guarantee that each subject is fully and distinctly represented. Our approach operates within a noise optimization framework, avoiding the need to retrain base models. Through extensive experiments on multiple benchmarks, we demonstrate that CoCoNO significantly improves text-image alignment and outperforms the current state of the art.

Paper Structure

This paper contains 15 sections, 7 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Images demonstrating subject mixing and neglect.
  • Figure 2: Our proposed CoCoNO alleviates subject neglect and mixing by ensuring one high-response self-attention segment for each subject (e.g., wolf and bear in second row), while minimizing interference between the cross-attention map for a subject (e.g., turtle in first row) with the self-attention segment of other subjects (dolphin here).
  • Figure 3: Intermediate one-step denoised attention maps.
  • Figure 4: A visual illustration of proposed method.
  • Figure 5: Qualitative comparisons of CoCoNO with recent state-of-the-art methods.
  • ...and 6 more figures