Localized Text-to-Image Generation for Free via Cross Attention Control

Yutong He; Ruslan Salakhutdinov; J. Zico Kolter

Localized Text-to-Image Generation for Free via Cross Attention Control

Yutong He, Ruslan Salakhutdinov, J. Zico Kolter

TL;DR

This work tackles the difficulty of localized text-to-image generation without retraining or extra inference cost. It introduces Cross Attention Control (CAC), a training-free plugin that manipulates cross-attention maps using localization prompts and masks to place content at specified image locations. The authors provide a standardized automatic evaluation suite and demonstrate that CAC improves localization across bounding boxes, semantic segmentation maps, and compositional prompts for base models like Stable Diffusion, MultiDiffusion, and GLIGEN. The approach broadens access to controllable, open-vocabulary generation while acknowledging potential risks and proposing safeguards and limitations for responsible deployment.

Abstract

Despite the tremendous success in text-to-image generative models, localized text-to-image generation (that is, generating objects or features at specific locations in an image while maintaining a consistent overall generation) still requires either explicit training or substantial additional inference time. In this work, we show that localized generation can be achieved by simply controlling cross attention maps during inference. With no additional training, model architecture modification or inference time, our proposed cross attention control (CAC) provides new open-vocabulary localization abilities to standard text-to-image models. CAC also enhances models that are already trained for localized generation when deployed at inference time. Furthermore, to assess localized text-to-image generation performance automatically, we develop a standardized suite of evaluations using large pretrained recognition models. Our experiments show that CAC improves localized generation performance with various types of location information ranging from bounding boxes to semantic segmentation maps, and enhances the compositional capability of state-of-the-art text-to-image generative models.

Localized Text-to-Image Generation for Free via Cross Attention Control

TL;DR

Abstract

Paper Structure (37 sections, 6 equations, 14 figures, 2 tables)

This paper contains 37 sections, 6 equations, 14 figures, 2 tables.

Introduction
Related Works
Method
Problem Setup
Text-to-Image Generation with Cross Attention
Cross Attention Control (CAC) for Localized Generation
Incorporating Self Attention Control
Experiments
Baselines
Localized Text-to-Image Generation
Generating with Bounding Boxes
Experiment Setting and Dataset
Evaluation Metrics
Results
Generating with Semantic Segmentation Maps
...and 22 more sections

Figures (14)

Figure 1: CAC as a plugin to existing methods for localized text-to-image generation. CAC improves upon diverse types of localization (bounding boxes, semantic segmentation maps and localized styles) with different base models (Stable Diffusion and GLIGEN).
Figure 2: The illustration of CAC for localized generation. CAC uses localized text descriptions and spatial constraints to manipulate the cross attention maps.
Figure 3: Illustration of generated images based on COCO bounding boxes.
Figure 4: Illustration of different approaches generating images via Cityscapes segmentation maps.
Figure 5: Experiment results with semantic segmentation information.
...and 9 more figures

Localized Text-to-Image Generation for Free via Cross Attention Control

TL;DR

Abstract

Localized Text-to-Image Generation for Free via Cross Attention Control

Authors

TL;DR

Abstract

Table of Contents

Figures (14)