Compositional Text-to-Image Generation with Dense Blob Representations
Weili Nie, Sifei Liu, Morteza Mardani, Chao Liu, Benjamin Eckart, Arash Vahdat
TL;DR
This work introduces dense blob representations as object-level grounding inputs for compositional text-to-image generation, where each blob carries a parametric ellipse $[c_x,c_y,a,b,\theta]$ and a descriptive caption. BlobGEN, a blob-grounded diffusion model, uses a novel masked cross-attention mechanism to confine each blob's influence to its local region and employs a two-pathway approach to generate blob parameters and descriptions from text prompts via in-context learning with LLMs. Through extensive MS-COCO and NSR-1K experiments, BlobGEN achieves state-of-the-art zero-shot generation quality and superior layout-guided controllability, with ablations highlighting the importance of masking and prompt-tuned blob descriptions. The approach enables precise local editing, robust compositional reasoning, and stronger integration with LLMs for planning layouts, offering a modular and scalable framework for controllable image synthesis that can be extended to broader grounding modalities and prompts.
Abstract
Existing text-to-image models struggle to follow complex text prompts, raising the need for extra grounding inputs for better controllability. In this work, we propose to decompose a scene into visual primitives - denoted as dense blob representations - that contain fine-grained details of the scene while being modular, human-interpretable, and easy-to-construct. Based on blob representations, we develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Particularly, we introduce a new masked cross-attention module to disentangle the fusion between blob representations and visual features. To leverage the compositionality of large language models (LLMs), we introduce a new in-context learning approach to generate blob representations from text prompts. Our extensive experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO. When augmented by LLMs, our method exhibits superior numerical and spatial correctness on compositional image generation benchmarks. Project page: https://blobgen-2d.github.io.
