A Two-Stage System for Layout-Controlled Image Generation using Large Language Models and Diffusion Models

Jan-Hendrik Koch; Jonas Krumme; Konrad Gadzicki

A Two-Stage System for Layout-Controlled Image Generation using Large Language Models and Diffusion Models

Jan-Hendrik Koch, Jonas Krumme, Konrad Gadzicki

TL;DR

The paper tackles the lack of precise compositional control in diffusion-based image synthesis by introducing a two-stage system that first uses an LLM to generate a structured layout from object lists and then renders a photorealistic image conditioned on that layout. A key insight is that complex layouts suffer from poor recall in LLMs, which is mitigated by task decomposition—producing core objects first and completing the rest with rule-based insertions—leading to near-perfect recall in dense scenes. The study compares ControlNet and GLIGEN as layout-conditioning methods, finding that ControlNet maintains text-based styling at the cost of occasional hallucinations, while GLIGEN offers stronger layout fidelity but reduced prompt controllability after finetuning. Overall, the decoupled approach demonstrates reliable generation of images with specified object counts and plausible spatial arrangements, highlighting a practical pathway for compositionally constrained synthesis and laying groundwork for broader domain applications.

Abstract

Text-to-image diffusion models exhibit remarkable generative capabilities, but lack precise control over object counts and spatial arrangements. This work introduces a two-stage system to address these compositional limitations. The first stage employs a Large Language Model (LLM) to generate a structured layout from a list of objects. The second stage uses a layout-conditioned diffusion model to synthesize a photorealistic image adhering to this layout. We find that task decomposition is critical for LLM-based spatial planning; by simplifying the initial generation to core objects and completing the layout with rule-based insertion, we improve object recall from 57.2% to 99.9% for complex scenes. For image synthesis, we compare two leading conditioning methods: ControlNet and GLIGEN. After domain-specific finetuning on table-setting datasets, we identify a key trade-off: ControlNet preserves text-based stylistic control but suffers from object hallucination, while GLIGEN provides superior layout fidelity at the cost of reduced prompt-based controllability. Our end-to-end system successfully generates images with specified object counts and plausible spatial arrangements, demonstrating the viability of a decoupled approach for compositionally controlled synthesis.

A Two-Stage System for Layout-Controlled Image Generation using Large Language Models and Diffusion Models

TL;DR

Abstract

A Two-Stage System for Layout-Controlled Image Generation using Large Language Models and Diffusion Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)