LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts
Hanan Gani, Shariq Farooq Bhat, Muzammal Naseer, Salman Khan, Peter Wonka
TL;DR
The paper tackles the difficulty diffusion-based text-to-image models have with lengthy, detailed prompts by introducing LLM-driven Scene Blueprints that decompose prompts into object bounding boxes, per-object descriptions, and a background. It then employs a two-phase generation: a Global Scene Generation stage to form an initial layout and image, followed by an Iterative Refinement Scheme that fine-tunes each object’s content using CLIP-guided diffusion with reference prototypes, ensuring faithful adherence to the prompt. Across extensive experiments and user studies, the approach achieves higher prompt adherence and image fidelity than baselines, with ablations showing the value of layout interpolation and multi-modal guidance. This work advances open-set, long-prompt image synthesis by integrating LLM reasoning, layout manipulation, and targeted per-object refinements. The proposed framework promises more accurate and coherent scene generation in complex, multi-object prompts, with practical implications for content creation and visual ontologies.
Abstract
Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts describing complex scenes with multiple objects. While excelling in generating images from short, single-object descriptions, these models often struggle to faithfully capture all the nuanced details within longer and more elaborate textual inputs. In response, we present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts, including bounding box coordinates for foreground objects, detailed textual descriptions for individual objects, and a succinct background context. These components form the foundation of our layout-to-image generation model, which operates in two phases. The initial Global Scene Generation utilizes object layouts and background context to create an initial scene but often falls short in faithfully representing object characteristics as specified in the prompts. To address this limitation, we introduce an Iterative Refinement Scheme that iteratively evaluates and refines box-level content to align them with their textual descriptions, recomposing objects as needed to ensure consistency. Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models. This is further validated by a user study, underscoring the efficacy of our approach in generating coherent and detailed scenes from intricate textual inputs.
