Table of Contents
Fetching ...

Layout-to-Image Generation with Localized Descriptions using ControlNet with Cross-Attention Control

Denis Lukovnikov, Asja Fischer

TL;DR

This work shows the limitations of ControlNet for the layout-to-image task and enables it to use localized descriptions using a training-free approach that modifies the crossattention scores during generation and develops a novel cross-attention manipulation method to maintain image quality while improving control.

Abstract

While text-to-image diffusion models can generate highquality images from textual descriptions, they generally lack fine-grained control over the visual composition of the generated images. Some recent works tackle this problem by training the model to condition the generation process on additional input describing the desired image layout. Arguably the most popular among such methods, ControlNet, enables a high degree of control over the generated image using various types of conditioning inputs (e.g. segmentation maps). However, it still lacks the ability to take into account localized textual descriptions that indicate which image region is described by which phrase in the prompt. In this work, we show the limitations of ControlNet for the layout-to-image task and enable it to use localized descriptions using a training-free approach that modifies the crossattention scores during generation. We adapt and investigate several existing cross-attention control methods in the context of ControlNet and identify shortcomings that cause failure (concept bleeding) or image degradation under specific conditions. To address these shortcomings, we develop a novel cross-attention manipulation method in order to maintain image quality while improving control. Qualitative and quantitative experimental studies focusing on challenging cases are presented, demonstrating the effectiveness of the investigated general approach, and showing the improvements obtained by the proposed cross-attention control method.

Layout-to-Image Generation with Localized Descriptions using ControlNet with Cross-Attention Control

TL;DR

This work shows the limitations of ControlNet for the layout-to-image task and enables it to use localized descriptions using a training-free approach that modifies the crossattention scores during generation and develops a novel cross-attention manipulation method to maintain image quality while improving control.

Abstract

While text-to-image diffusion models can generate highquality images from textual descriptions, they generally lack fine-grained control over the visual composition of the generated images. Some recent works tackle this problem by training the model to condition the generation process on additional input describing the desired image layout. Arguably the most popular among such methods, ControlNet, enables a high degree of control over the generated image using various types of conditioning inputs (e.g. segmentation maps). However, it still lacks the ability to take into account localized textual descriptions that indicate which image region is described by which phrase in the prompt. In this work, we show the limitations of ControlNet for the layout-to-image task and enable it to use localized descriptions using a training-free approach that modifies the crossattention scores during generation. We adapt and investigate several existing cross-attention control methods in the context of ControlNet and identify shortcomings that cause failure (concept bleeding) or image degradation under specific conditions. To address these shortcomings, we develop a novel cross-attention manipulation method in order to maintain image quality while improving control. Qualitative and quantitative experimental studies focusing on challenging cases are presented, demonstrating the effectiveness of the investigated general approach, and showing the improvements obtained by the proposed cross-attention control method.
Paper Structure (41 sections, 10 equations, 27 figures, 3 tables)

This paper contains 41 sections, 10 equations, 27 figures, 3 tables.

Figures (27)

  • Figure 1: An example of the task. The input consists of masks (a)-(d) and the annotated prompt in the caption of (e). The desired output is shown in (e). See \ref{['sec:task']}.
  • Figure 2: A diagram illustrating attention redistribution and attention boosting with the running example.
  • Figure 3: Layouts used for qualitative comparison throughout this paper (first three layouts are used in Fig. \ref{['fig:qualitative']}, the last layout in Fig. \ref{['fig:complexshapes']}).
  • Figure 4: A qualitative comparison of different cross-attention control methods in ControlNet-extended Stable Diffusion. Results for multiple seeds are shown to illustrate how consistent the generation results are. See Fig. \ref{['fig:layouts']} for layout specification.
  • Figure 5: Image quality with increasing control strength for eDiff-I and DenseDiffusion cross-attention control with ControlNet.
  • ...and 22 more figures