Table of Contents
Fetching ...

LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

Hanan Gani, Shariq Farooq Bhat, Muzammal Naseer, Salman Khan, Peter Wonka

TL;DR

The paper tackles the difficulty diffusion-based text-to-image models have with lengthy, detailed prompts by introducing LLM-driven Scene Blueprints that decompose prompts into object bounding boxes, per-object descriptions, and a background. It then employs a two-phase generation: a Global Scene Generation stage to form an initial layout and image, followed by an Iterative Refinement Scheme that fine-tunes each object’s content using CLIP-guided diffusion with reference prototypes, ensuring faithful adherence to the prompt. Across extensive experiments and user studies, the approach achieves higher prompt adherence and image fidelity than baselines, with ablations showing the value of layout interpolation and multi-modal guidance. This work advances open-set, long-prompt image synthesis by integrating LLM reasoning, layout manipulation, and targeted per-object refinements. The proposed framework promises more accurate and coherent scene generation in complex, multi-object prompts, with practical implications for content creation and visual ontologies.

Abstract

Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts describing complex scenes with multiple objects. While excelling in generating images from short, single-object descriptions, these models often struggle to faithfully capture all the nuanced details within longer and more elaborate textual inputs. In response, we present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts, including bounding box coordinates for foreground objects, detailed textual descriptions for individual objects, and a succinct background context. These components form the foundation of our layout-to-image generation model, which operates in two phases. The initial Global Scene Generation utilizes object layouts and background context to create an initial scene but often falls short in faithfully representing object characteristics as specified in the prompts. To address this limitation, we introduce an Iterative Refinement Scheme that iteratively evaluates and refines box-level content to align them with their textual descriptions, recomposing objects as needed to ensure consistency. Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models. This is further validated by a user study, underscoring the efficacy of our approach in generating coherent and detailed scenes from intricate textual inputs.

LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

TL;DR

The paper tackles the difficulty diffusion-based text-to-image models have with lengthy, detailed prompts by introducing LLM-driven Scene Blueprints that decompose prompts into object bounding boxes, per-object descriptions, and a background. It then employs a two-phase generation: a Global Scene Generation stage to form an initial layout and image, followed by an Iterative Refinement Scheme that fine-tunes each object’s content using CLIP-guided diffusion with reference prototypes, ensuring faithful adherence to the prompt. Across extensive experiments and user studies, the approach achieves higher prompt adherence and image fidelity than baselines, with ablations showing the value of layout interpolation and multi-modal guidance. This work advances open-set, long-prompt image synthesis by integrating LLM reasoning, layout manipulation, and targeted per-object refinements. The proposed framework promises more accurate and coherent scene generation in complex, multi-object prompts, with practical implications for content creation and visual ontologies.

Abstract

Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts describing complex scenes with multiple objects. While excelling in generating images from short, single-object descriptions, these models often struggle to faithfully capture all the nuanced details within longer and more elaborate textual inputs. In response, we present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts, including bounding box coordinates for foreground objects, detailed textual descriptions for individual objects, and a succinct background context. These components form the foundation of our layout-to-image generation model, which operates in two phases. The initial Global Scene Generation utilizes object layouts and background context to create an initial scene but often falls short in faithfully representing object characteristics as specified in the prompts. To address this limitation, we introduce an Iterative Refinement Scheme that iteratively evaluates and refines box-level content to align them with their textual descriptions, recomposing objects as needed to ensure consistency. Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models. This is further validated by a user study, underscoring the efficacy of our approach in generating coherent and detailed scenes from intricate textual inputs.
Paper Structure (19 sections, 8 equations, 13 figures, 2 tables, 1 algorithm)

This paper contains 19 sections, 8 equations, 13 figures, 2 tables, 1 algorithm.

Figures (13)

  • Figure 1: Current state-of-the-art text-to-image models (Columns 1-4) face challenges when dealing with lengthy and detailed text prompts, resulting in the exclusion of objects and fine-grained details. Our approach (Column 5) adeptly encompasses all the objects described, preserving their intricate features and spatial characteristics as outlined in the two white boxes.
  • Figure 2: Global Scene Generation: Our proposed approach takes a long text prompt describing a complex scene and leverages an LLM to generate $k$ layouts which are then interpolated to a single layout, ensuring the spatial accuracy of object placement. Along with the layouts, we also query an LLM to generate object descriptions along with a concise background prompt summarizing the scene's essence. A Layout-to-Image model is employed which transforms the layout into an initial image. Iterative Refinement Scheme: The content of each box proposal is refined using a diffusion model conditioned on a box mask, a (generated) reference image for the box, and the source image, guided by a multi-modal signal.
  • Figure 3: Effect of interpolation factor $\eta$: We interpolate the $k$ bounding boxes for each object and control the interpolation by the factor $\eta$. We visualize the change in the bounding box location of "a white cat" highlighted in the text for different $\eta$ values from 0.1 to 0.9 with increments of 0.1. Best viewed in zoom.
  • Figure 4: User study. A majority of users picked our method compared to prior works when presented with a 2-AFC task of selecting the image that adheres to the given prompt the most.
  • Figure 5: Qualitative comparisons: We compare our image generation method to state-of-the-art baselines, including those using layouts. The underlined text in the text prompts represents the objects, their characteristics, and spatial properties. Red text highlights missing objects, purple signifies inaccuracies in object positioning, and black text points out implausible or deformed elements. Baseline methods often omit objects and struggle with spatial accuracy (first four columns), while our approach excels in capturing all objects and preserving spatial attributes (last column).
  • ...and 8 more figures