Table of Contents
Fetching ...

Self-correcting LLM-controlled Diffusion Models

Tsung-Han Wu, Long Lian, Joseph E. Gonzalez, Boyi Li, Trevor Darrell

TL;DR

Diffusion-based text-to-image models often misinterpret complex prompts. SLD introduces a closed-loop system with an LLM-driven detector and an LLM controller to iteratively correct generated images without retraining, and it is compatible with API-backed models like DALL-E 3. The approach enables both generation and fine-grained object-level editing via latent-space operations, significantly improving numeracy, attribute binding, and spatial reasoning. Empirical results demonstrate strong improvement across generation and editing tasks, highlighting practical impact for accurate prompt-to-image synthesis and editing workflows.

Abstract

Text-to-image generation has witnessed significant progress with the advent of diffusion models. Despite the ability to generate photorealistic images, current text-to-image diffusion models still often struggle to accurately interpret and follow complex input text prompts. In contrast to existing models that aim to generate images only with their best effort, we introduce Self-correcting LLM-controlled Diffusion (SLD). SLD is a framework that generates an image from the input prompt, assesses its alignment with the prompt, and performs self-corrections on the inaccuracies in the generated image. Steered by an LLM controller, SLD turns text-to-image generation into an iterative closed-loop process, ensuring correctness in the resulting image. SLD is not only training-free but can also be seamlessly integrated with diffusion models behind API access, such as DALL-E 3, to further boost the performance of state-of-the-art diffusion models. Experimental results show that our approach can rectify a majority of incorrect generations, particularly in generative numeracy, attribute binding, and spatial relationships. Furthermore, by simply adjusting the instructions to the LLM, SLD can perform image editing tasks, bridging the gap between text-to-image generation and image editing pipelines. We will make our code available for future research and applications.

Self-correcting LLM-controlled Diffusion Models

TL;DR

Diffusion-based text-to-image models often misinterpret complex prompts. SLD introduces a closed-loop system with an LLM-driven detector and an LLM controller to iteratively correct generated images without retraining, and it is compatible with API-backed models like DALL-E 3. The approach enables both generation and fine-grained object-level editing via latent-space operations, significantly improving numeracy, attribute binding, and spatial reasoning. Empirical results demonstrate strong improvement across generation and editing tasks, highlighting practical impact for accurate prompt-to-image synthesis and editing workflows.

Abstract

Text-to-image generation has witnessed significant progress with the advent of diffusion models. Despite the ability to generate photorealistic images, current text-to-image diffusion models still often struggle to accurately interpret and follow complex input text prompts. In contrast to existing models that aim to generate images only with their best effort, we introduce Self-correcting LLM-controlled Diffusion (SLD). SLD is a framework that generates an image from the input prompt, assesses its alignment with the prompt, and performs self-corrections on the inaccuracies in the generated image. Steered by an LLM controller, SLD turns text-to-image generation into an iterative closed-loop process, ensuring correctness in the resulting image. SLD is not only training-free but can also be seamlessly integrated with diffusion models behind API access, such as DALL-E 3, to further boost the performance of state-of-the-art diffusion models. Experimental results show that our approach can rectify a majority of incorrect generations, particularly in generative numeracy, attribute binding, and spatial relationships. Furthermore, by simply adjusting the instructions to the LLM, SLD can perform image editing tasks, bridging the gap between text-to-image generation and image editing pipelines. We will make our code available for future research and applications.
Paper Structure (18 sections, 10 figures, 7 tables, 1 algorithm)

This paper contains 18 sections, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: Existing diffusion-based text-to-image generators (e.g., DALL-E 3 dalle3) generally struggle to precisely generate images that correctly align with complex input prompts, especially for the ones that require numeracy and spatial relationships. Our Self-correcting LLM-controlled Diffusion (SLD) framework empowers these diffusion models to automatically and iteratively rectify inaccuracies through applying a set of latent space operations (addition, deletion, repositioning, etc.), resulting in enhanced text-to-image alignment.
  • Figure 2: Our proposed Self-correcting LLM-controlled Diffusion (SLD) enhances text-to-image alignment through an iterative self-correction process. It begins with LLM-driven object detection (\ref{['ssec:object_detection']}), and subsequently performs LLM-controlled analysis and correction (\ref{['ssec:llm_controlled_image_analysis']}). The entire pipeline is outlined in \ref{['alg:self-correct-image-generation']}.
  • Figure 3: Our self-correction pipeline is driven by two distinct LLMs: (a) The LLM parser analyzes user prompts $P$ to extract a list of key object information $S$, which is then passed to the open-vocabulary detector. (b) The LLM controller, taking both the user prompt $P$ and currently detected bounding boxes $B_{curr}$ as input, outputs suggested new bounding boxes $B_{next}$. These are subsequently transformed into a set of latent space operations $Ops$ for image manipulation.
  • Figure 4: Our latent operations can be summarized into two key concepts: (1) latent in removed regions are re-initialized to Gaussian noise, and latent of newly added or modified objects are composited onto the canvas. (2) Latent composition is confined to the initial steps, followed by "unfrozen" steps for a standard forward diffusion process, enhancing visual quality and avoiding artificial copy-and-paste effects.
  • Figure 5: SLD enhances text-to-image alignment across diverse diffusion-based generative models such as SDXL, LMD+, and DALL-E 3. Notably, as highlighted by the red boxes in the first row, SLD precisely positions a blue bicycle in relation to a bench and a palm tree, while maintaining the accurate count of palm trees and seagulls. The second row further demonstrates SLD's robustness in complex, cluttered scenes, effectively managing object collision through our training-free latent operations.
  • ...and 5 more figures