Compositional Text-to-Image Generation with Dense Blob Representations

Weili Nie; Sifei Liu; Morteza Mardani; Chao Liu; Benjamin Eckart; Arash Vahdat

Compositional Text-to-Image Generation with Dense Blob Representations

Weili Nie, Sifei Liu, Morteza Mardani, Chao Liu, Benjamin Eckart, Arash Vahdat

TL;DR

This work introduces dense blob representations as object-level grounding inputs for compositional text-to-image generation, where each blob carries a parametric ellipse $[c_x,c_y,a,b,\theta]$ and a descriptive caption. BlobGEN, a blob-grounded diffusion model, uses a novel masked cross-attention mechanism to confine each blob's influence to its local region and employs a two-pathway approach to generate blob parameters and descriptions from text prompts via in-context learning with LLMs. Through extensive MS-COCO and NSR-1K experiments, BlobGEN achieves state-of-the-art zero-shot generation quality and superior layout-guided controllability, with ablations highlighting the importance of masking and prompt-tuned blob descriptions. The approach enables precise local editing, robust compositional reasoning, and stronger integration with LLMs for planning layouts, offering a modular and scalable framework for controllable image synthesis that can be extended to broader grounding modalities and prompts.

Abstract

Existing text-to-image models struggle to follow complex text prompts, raising the need for extra grounding inputs for better controllability. In this work, we propose to decompose a scene into visual primitives - denoted as dense blob representations - that contain fine-grained details of the scene while being modular, human-interpretable, and easy-to-construct. Based on blob representations, we develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Particularly, we introduce a new masked cross-attention module to disentangle the fusion between blob representations and visual features. To leverage the compositionality of large language models (LLMs), we introduce a new in-context learning approach to generate blob representations from text prompts. Our extensive experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO. When augmented by LLMs, our method exhibits superior numerical and spatial correctness on compositional image generation benchmarks. Project page: https://blobgen-2d.github.io.

Compositional Text-to-Image Generation with Dense Blob Representations

TL;DR

This work introduces dense blob representations as object-level grounding inputs for compositional text-to-image generation, where each blob carries a parametric ellipse

and a descriptive caption. BlobGEN, a blob-grounded diffusion model, uses a novel masked cross-attention mechanism to confine each blob's influence to its local region and employs a two-pathway approach to generate blob parameters and descriptions from text prompts via in-context learning with LLMs. Through extensive MS-COCO and NSR-1K experiments, BlobGEN achieves state-of-the-art zero-shot generation quality and superior layout-guided controllability, with ablations highlighting the importance of masking and prompt-tuned blob descriptions. The approach enables precise local editing, robust compositional reasoning, and stronger integration with LLMs for planning layouts, offering a modular and scalable framework for controllable image synthesis that can be extended to broader grounding modalities and prompts.

Abstract

Paper Structure (59 sections, 4 equations, 17 figures, 7 tables)

This paper contains 59 sections, 4 equations, 17 figures, 7 tables.

Introduction
Method
Image Decomposition into Blob Representations
Blob-grounded Text-to-Image Generation
Blob Embedding.
Masked Cross-Attention.
Other Design Choices.
LLMs for Blob Generation
Blob Parameter Generation.
Blob Description Generation.
Related Work
Text-to-Image Generation.
Compositional Image Generation.
LLM-augmented Image Generation.
Experiments
...and 44 more sections

Figures (17)

Figure 1: Generated images from blob representations can reconstruct fine-grained details of real images. Each row shows the real image (Left), blobs (Middle), and two randomly generated samples (Right). We do not show blob descriptions for simplicity.
Figure 2: (a) We extract blob representations (parameters and descriptions) using existing tools to guide the text-to-image diffusion model. (b) Our model leverages a novel masked cross-attention module that allows visual features to attend to only corresponding blobs.
Figure 3: Zero-shot layout-grounded generation results of GLIGEN and our method on MS-COCO validation set. In each row, we visualize the reference real image (Left), bounding boxes and GLIGEN generated image (Middle), blobs and our generated image (Right). All images are in resolution of 512$\times$512.
Figure 4: Various image editing results of our method on the MS-COCO validation set, where each example contains two generated images: (Left) original setting and (Right) after editing. The top row shows the local editing results where we only change the blob description and since the blob parameters stay the same after editing, we do not show blob visualizations. The bottom two rows show the object reposition results where we only change the blob parameter. All images are in resolution of 512$\times$512.
Figure 5: Qualitative results of our method on two compositional generation tasks of NSR-1K feng2023layoutgpt: (a) spatial reasoning and (b) numerical reasoning. Given a caption, we prompt GPT4 to generate blob parameters (Left) and LLAMA-13B to generate blob descriptions (not shown in the figure), which are passed to our blob-grounded text-to-image model to synthesize an image (Right).
...and 12 more figures

Compositional Text-to-Image Generation with Dense Blob Representations

TL;DR

Abstract

Compositional Text-to-Image Generation with Dense Blob Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (17)