Table of Contents
Fetching ...

Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models

Zhipeng Bao, Yijun Li, Krishna Kumar Singh, Yu-Xiong Wang, Martial Hebert

TL;DR

The paper tackles compositional misalignment in diffusion-based text-to-image generation, caused by low attention activation and overlapping cross-attention masks when rendering multiple objects. It introduces Separate-and-Enhance (SepEn), a lightweight finetuning approach that applies two losses to cross-attention: Separate loss decouples object masks to reduce overlap, and Enhance loss boosts activation for each object; finetuning is confined to the cross-attention key mappings to ensure scalability. Across single-prompt and large-scale multi-concept experiments, SepEn improves text-image alignment and realism while preserving single-object quality, and demonstrates strong generalization to unseen concepts without requiring extra supervision. The method offers a practical path to more faithful compositional generation in T2I diffusion models, with implications for broader applicability and reliable multi-object synthesis. However, polysemy and language understanding remain challenges, suggesting future integration with larger language models for disambiguation.

Abstract

Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation. This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps. While previous research efforts have individually tackled these issues, we assert that a holistic approach is paramount. Thus, we propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores, respectively. Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability. Comprehensive evaluations demonstrate the superior performance of our model in terms of image realism, text-image alignment, and adaptability, notably outperforming prominent baselines. Ultimately, this research paves the way for T2I diffusion models with enhanced compositional capacities and broader applicability.

Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models

TL;DR

The paper tackles compositional misalignment in diffusion-based text-to-image generation, caused by low attention activation and overlapping cross-attention masks when rendering multiple objects. It introduces Separate-and-Enhance (SepEn), a lightweight finetuning approach that applies two losses to cross-attention: Separate loss decouples object masks to reduce overlap, and Enhance loss boosts activation for each object; finetuning is confined to the cross-attention key mappings to ensure scalability. Across single-prompt and large-scale multi-concept experiments, SepEn improves text-image alignment and realism while preserving single-object quality, and demonstrates strong generalization to unseen concepts without requiring extra supervision. The method offers a practical path to more faithful compositional generation in T2I diffusion models, with implications for broader applicability and reliable multi-object synthesis. However, polysemy and language understanding remain challenges, suggesting future integration with larger language models for disambiguation.

Abstract

Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation. This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps. While previous research efforts have individually tackled these issues, we assert that a holistic approach is paramount. Thus, we propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores, respectively. Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability. Comprehensive evaluations demonstrate the superior performance of our model in terms of image realism, text-image alignment, and adaptability, notably outperforming prominent baselines. Ultimately, this research paves the way for T2I diffusion models with enhanced compositional capacities and broader applicability.
Paper Structure (22 sections, 7 equations, 20 figures, 8 tables)

This paper contains 22 sections, 7 equations, 20 figures, 8 tables.

Figures (20)

  • Figure 1: Visual comparisons between Stable Diffusion rombach2022high and our method. Left: compositional finetuning with individual concepts. After finetuning, we are able to generate more aligned images with the text input of high quality. Right: joint compositional finetuning with a large collection of concepts. After finetuning, the model keeps a high compositional capacity for unseen novel concepts.
  • Figure 2: (Top) Failure cases and (Bottom) reasons of Stable Diffusion rombach2022high. Even state-of-the-art T2I models grapple with challenges when representing multiple objects with varying attributes. Two primary factors, demonstrated by the bottom two examples respectively, include: (1) low attention activation scores for certain objects and (2) the attention masks overlap.
  • Figure 3: Example of before and after applying our separate loss $\mathcal{L}_\mathrm{Sep}$. Left: attention maps from SD. Note that the bottle attention falls onto the bowl which results in failing to generate the bottle. Right: attention maps from our model where the attention of bottle and bowl are better separated so that each concept is validly generated.
  • Figure 4: Example of before and after applying our enhance loss $\mathcal{L}_\mathrm{En}$. Left: attention maps from SD. The attention score of mouse is lower than that of cat on the bottom right region, resulting in generating a cat-like mouse. Right: attention maps from our model where the attention activation score of mouse is enhanced so that it can be correctly generated.
  • Figure 5: Average parameter changes for the whole network (top) and inside cross-attention modules (bottom) during finetuning. The parameters in the cross-attention modules are more sensitive to finetuning, especially for the key mapping functions.
  • ...and 15 more figures