Table of Contents
Fetching ...

Compositional Text-to-Image Generation with Dense Blob Representations

Weili Nie, Sifei Liu, Morteza Mardani, Chao Liu, Benjamin Eckart, Arash Vahdat

TL;DR

This work introduces dense blob representations as object-level grounding inputs for compositional text-to-image generation, where each blob carries a parametric ellipse $[c_x,c_y,a,b,\theta]$ and a descriptive caption. BlobGEN, a blob-grounded diffusion model, uses a novel masked cross-attention mechanism to confine each blob's influence to its local region and employs a two-pathway approach to generate blob parameters and descriptions from text prompts via in-context learning with LLMs. Through extensive MS-COCO and NSR-1K experiments, BlobGEN achieves state-of-the-art zero-shot generation quality and superior layout-guided controllability, with ablations highlighting the importance of masking and prompt-tuned blob descriptions. The approach enables precise local editing, robust compositional reasoning, and stronger integration with LLMs for planning layouts, offering a modular and scalable framework for controllable image synthesis that can be extended to broader grounding modalities and prompts.

Abstract

Existing text-to-image models struggle to follow complex text prompts, raising the need for extra grounding inputs for better controllability. In this work, we propose to decompose a scene into visual primitives - denoted as dense blob representations - that contain fine-grained details of the scene while being modular, human-interpretable, and easy-to-construct. Based on blob representations, we develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Particularly, we introduce a new masked cross-attention module to disentangle the fusion between blob representations and visual features. To leverage the compositionality of large language models (LLMs), we introduce a new in-context learning approach to generate blob representations from text prompts. Our extensive experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO. When augmented by LLMs, our method exhibits superior numerical and spatial correctness on compositional image generation benchmarks. Project page: https://blobgen-2d.github.io.

Compositional Text-to-Image Generation with Dense Blob Representations

TL;DR

This work introduces dense blob representations as object-level grounding inputs for compositional text-to-image generation, where each blob carries a parametric ellipse and a descriptive caption. BlobGEN, a blob-grounded diffusion model, uses a novel masked cross-attention mechanism to confine each blob's influence to its local region and employs a two-pathway approach to generate blob parameters and descriptions from text prompts via in-context learning with LLMs. Through extensive MS-COCO and NSR-1K experiments, BlobGEN achieves state-of-the-art zero-shot generation quality and superior layout-guided controllability, with ablations highlighting the importance of masking and prompt-tuned blob descriptions. The approach enables precise local editing, robust compositional reasoning, and stronger integration with LLMs for planning layouts, offering a modular and scalable framework for controllable image synthesis that can be extended to broader grounding modalities and prompts.

Abstract

Existing text-to-image models struggle to follow complex text prompts, raising the need for extra grounding inputs for better controllability. In this work, we propose to decompose a scene into visual primitives - denoted as dense blob representations - that contain fine-grained details of the scene while being modular, human-interpretable, and easy-to-construct. Based on blob representations, we develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Particularly, we introduce a new masked cross-attention module to disentangle the fusion between blob representations and visual features. To leverage the compositionality of large language models (LLMs), we introduce a new in-context learning approach to generate blob representations from text prompts. Our extensive experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO. When augmented by LLMs, our method exhibits superior numerical and spatial correctness on compositional image generation benchmarks. Project page: https://blobgen-2d.github.io.
Paper Structure (59 sections, 4 equations, 17 figures, 7 tables)

This paper contains 59 sections, 4 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Generated images from blob representations can reconstruct fine-grained details of real images. Each row shows the real image (Left), blobs (Middle), and two randomly generated samples (Right). We do not show blob descriptions for simplicity.
  • Figure 2: (a) We extract blob representations (parameters and descriptions) using existing tools to guide the text-to-image diffusion model. (b) Our model leverages a novel masked cross-attention module that allows visual features to attend to only corresponding blobs.
  • Figure 3: Zero-shot layout-grounded generation results of GLIGEN and our method on MS-COCO validation set. In each row, we visualize the reference real image (Left), bounding boxes and GLIGEN generated image (Middle), blobs and our generated image (Right). All images are in resolution of 512$\times$512.
  • Figure 4: Various image editing results of our method on the MS-COCO validation set, where each example contains two generated images: (Left) original setting and (Right) after editing. The top row shows the local editing results where we only change the blob description and since the blob parameters stay the same after editing, we do not show blob visualizations. The bottom two rows show the object reposition results where we only change the blob parameter. All images are in resolution of 512$\times$512.
  • Figure 5: Qualitative results of our method on two compositional generation tasks of NSR-1K feng2023layoutgpt: (a) spatial reasoning and (b) numerical reasoning. Given a caption, we prompt GPT4 to generate blob parameters (Left) and LLAMA-13B to generate blob descriptions (not shown in the figure), which are passed to our blob-grounded text-to-image model to synthesize an image (Right).
  • ...and 12 more figures