Table of Contents
Fetching ...

A Two-Stage System for Layout-Controlled Image Generation using Large Language Models and Diffusion Models

Jan-Hendrik Koch, Jonas Krumme, Konrad Gadzicki

TL;DR

The paper tackles the lack of precise compositional control in diffusion-based image synthesis by introducing a two-stage system that first uses an LLM to generate a structured layout from object lists and then renders a photorealistic image conditioned on that layout. A key insight is that complex layouts suffer from poor recall in LLMs, which is mitigated by task decomposition—producing core objects first and completing the rest with rule-based insertions—leading to near-perfect recall in dense scenes. The study compares ControlNet and GLIGEN as layout-conditioning methods, finding that ControlNet maintains text-based styling at the cost of occasional hallucinations, while GLIGEN offers stronger layout fidelity but reduced prompt controllability after finetuning. Overall, the decoupled approach demonstrates reliable generation of images with specified object counts and plausible spatial arrangements, highlighting a practical pathway for compositionally constrained synthesis and laying groundwork for broader domain applications.

Abstract

Text-to-image diffusion models exhibit remarkable generative capabilities, but lack precise control over object counts and spatial arrangements. This work introduces a two-stage system to address these compositional limitations. The first stage employs a Large Language Model (LLM) to generate a structured layout from a list of objects. The second stage uses a layout-conditioned diffusion model to synthesize a photorealistic image adhering to this layout. We find that task decomposition is critical for LLM-based spatial planning; by simplifying the initial generation to core objects and completing the layout with rule-based insertion, we improve object recall from 57.2% to 99.9% for complex scenes. For image synthesis, we compare two leading conditioning methods: ControlNet and GLIGEN. After domain-specific finetuning on table-setting datasets, we identify a key trade-off: ControlNet preserves text-based stylistic control but suffers from object hallucination, while GLIGEN provides superior layout fidelity at the cost of reduced prompt-based controllability. Our end-to-end system successfully generates images with specified object counts and plausible spatial arrangements, demonstrating the viability of a decoupled approach for compositionally controlled synthesis.

A Two-Stage System for Layout-Controlled Image Generation using Large Language Models and Diffusion Models

TL;DR

The paper tackles the lack of precise compositional control in diffusion-based image synthesis by introducing a two-stage system that first uses an LLM to generate a structured layout from object lists and then renders a photorealistic image conditioned on that layout. A key insight is that complex layouts suffer from poor recall in LLMs, which is mitigated by task decomposition—producing core objects first and completing the rest with rule-based insertions—leading to near-perfect recall in dense scenes. The study compares ControlNet and GLIGEN as layout-conditioning methods, finding that ControlNet maintains text-based styling at the cost of occasional hallucinations, while GLIGEN offers stronger layout fidelity but reduced prompt controllability after finetuning. Overall, the decoupled approach demonstrates reliable generation of images with specified object counts and plausible spatial arrangements, highlighting a practical pathway for compositionally constrained synthesis and laying groundwork for broader domain applications.

Abstract

Text-to-image diffusion models exhibit remarkable generative capabilities, but lack precise control over object counts and spatial arrangements. This work introduces a two-stage system to address these compositional limitations. The first stage employs a Large Language Model (LLM) to generate a structured layout from a list of objects. The second stage uses a layout-conditioned diffusion model to synthesize a photorealistic image adhering to this layout. We find that task decomposition is critical for LLM-based spatial planning; by simplifying the initial generation to core objects and completing the layout with rule-based insertion, we improve object recall from 57.2% to 99.9% for complex scenes. For image synthesis, we compare two leading conditioning methods: ControlNet and GLIGEN. After domain-specific finetuning on table-setting datasets, we identify a key trade-off: ControlNet preserves text-based stylistic control but suffers from object hallucination, while GLIGEN provides superior layout fidelity at the cost of reduced prompt-based controllability. Our end-to-end system successfully generates images with specified object counts and plausible spatial arrangements, demonstrating the viability of a decoupled approach for compositionally controlled synthesis.

Paper Structure

This paper contains 30 sections, 3 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Image generated with Stable Diffusion Rombach2022 1.5 with the text prompt "A laid table with 4 plates". Stable Diffusion is not counting correctly, and the placement is not useful.
  • Figure 2: Example of a layout converted to a segmentation map (\ref{['fig:segmentation_from_layout']}) and a real segmentation map from the dataset (\ref{['fig:ease_segmentation']}). (Images cropped).
  • Figure 3: Layout generation with task decomposition and rule-based completion. (\ref{['fig:layout1']}) Simplified layouts generated using the plates-only and place settings-only approaches, containing 4-8 core objects per layout. The object list is shown at the bottom of each layout. (\ref{['fig:layout2']}) Final layouts after rule-based completion, where auxiliary objects (fork, knife, spoon, bowl, and two glasses per place setting) were inserted according to table-setting conventions. Generated objects are shown in black, while inserted objects are shown in red. This two-stage approach achieves 99.9% recall compared to 57.2% for direct generation of complete layouts.
  • Figure 4: Comparative results of layout-conditioned image synthesis using finetuned models. (\ref{['fig:generated1']}) Images generated by ControlNet conditioned on a segmentation map, showing results from the original pre-trained model versus checkpoints from our domain-specific finetuning. (\ref{['fig:generated2']}) Images generated by GLIGEN conditioned on a bounding box layout, comparing results from the original pre-trained model with our finetuned checkpoints. While both methods improve rendering quality with finetuning, ControlNet is prone to hallucinating extra objects, whereas GLIGEN demonstrates superior layout fidelity but reduced text-based stylistic control.
  • Figure 5: End-to-end results of the complete system for two-person (\ref{['fig:generated_2p']}) and four-person (\ref{['fig:generated_4p']}) table settings. For each setting, a layout was first generated using the plates-only approach with rule-based completion. The resulting layout was then used to generate images with both finetuned ControlNet and GLIGEN. The system successfully produces images with correct object counts and plausible spatial arrangements, demonstrating the viability of the two-stage approach for compositionally controlled synthesis.