Table of Contents
Fetching ...

Chain-of-Image Generation: Toward Monitorable and Controllable Image Generation

Young Kyung Kim, Oded Schlesinger, Yuzhou Zhao, J. Matias Di Martino, Guillermo Sapiro

TL;DR

This work tackles the opacity of text-to-image generation by introducing Chain-of-Image Generation (CoIG), a two-stage framework that uses an LLM to decompose complex prompts into a sequence of semantically focused sub-prompts (CSP) and an Autoregressive Refinement Model (ARM) to iteratively generate and edit images. By enforcing a human-readable, stepwise process and a compositional lock, CoIG achieves improved monitorability via two metrics: CoIG Readability and Causal Relevance, while mitigating entity collapse through sequential task decomposition. The approach is model-agnostic and demonstrates competitive compositionality across standard benchmarks, along with significant improvements on the Entity Collapse Benchmark, highlighting enhanced robustness and reliability for monitorable image generation. Overall, CoIG advances trustworthy image synthesis by making the generative process observable, controllable, and verifiably causal to the final outputs, with broad practical implications for safety and accountability in AI systems.

Abstract

While state-of-the-art image generation models achieve remarkable visual quality, their internal generative processes remain a "black box." This opacity limits human observation and intervention, and poses a barrier to ensuring model reliability, safety, and control. Furthermore, their non-human-like workflows make them difficult for human observers to interpret. To address this, we introduce the Chain-of-Image Generation (CoIG) framework, which reframes image generation as a sequential, semantic process analogous to how humans create art. Similar to the advantages in monitorability and performance that Chain-of-Thought (CoT) brought to large language models (LLMs), CoIG can produce equivalent benefits in text-to-image generation. CoIG utilizes an LLM to decompose a complex prompt into a sequence of simple, step-by-step instructions. The image generation model then executes this plan by progressively generating and editing the image. Each step focuses on a single semantic entity, enabling direct monitoring. We formally assess this property using two novel metrics: CoIG Readability, which evaluates the clarity of each intermediate step via its corresponding output; and Causal Relevance, which quantifies the impact of each procedural step on the final generated image. We further show that our framework mitigates entity collapse by decomposing the complex generation task into simple subproblems, analogous to the procedural reasoning employed by CoT. Our experimental results indicate that CoIG substantially enhances quantitative monitorability while achieving competitive compositional robustness compared to established baseline models. The framework is model-agnostic and can be integrated with any image generation model.

Chain-of-Image Generation: Toward Monitorable and Controllable Image Generation

TL;DR

This work tackles the opacity of text-to-image generation by introducing Chain-of-Image Generation (CoIG), a two-stage framework that uses an LLM to decompose complex prompts into a sequence of semantically focused sub-prompts (CSP) and an Autoregressive Refinement Model (ARM) to iteratively generate and edit images. By enforcing a human-readable, stepwise process and a compositional lock, CoIG achieves improved monitorability via two metrics: CoIG Readability and Causal Relevance, while mitigating entity collapse through sequential task decomposition. The approach is model-agnostic and demonstrates competitive compositionality across standard benchmarks, along with significant improvements on the Entity Collapse Benchmark, highlighting enhanced robustness and reliability for monitorable image generation. Overall, CoIG advances trustworthy image synthesis by making the generative process observable, controllable, and verifiably causal to the final outputs, with broad practical implications for safety and accountability in AI systems.

Abstract

While state-of-the-art image generation models achieve remarkable visual quality, their internal generative processes remain a "black box." This opacity limits human observation and intervention, and poses a barrier to ensuring model reliability, safety, and control. Furthermore, their non-human-like workflows make them difficult for human observers to interpret. To address this, we introduce the Chain-of-Image Generation (CoIG) framework, which reframes image generation as a sequential, semantic process analogous to how humans create art. Similar to the advantages in monitorability and performance that Chain-of-Thought (CoT) brought to large language models (LLMs), CoIG can produce equivalent benefits in text-to-image generation. CoIG utilizes an LLM to decompose a complex prompt into a sequence of simple, step-by-step instructions. The image generation model then executes this plan by progressively generating and editing the image. Each step focuses on a single semantic entity, enabling direct monitoring. We formally assess this property using two novel metrics: CoIG Readability, which evaluates the clarity of each intermediate step via its corresponding output; and Causal Relevance, which quantifies the impact of each procedural step on the final generated image. We further show that our framework mitigates entity collapse by decomposing the complex generation task into simple subproblems, analogous to the procedural reasoning employed by CoT. Our experimental results indicate that CoIG substantially enhances quantitative monitorability while achieving competitive compositional robustness compared to established baseline models. The framework is model-agnostic and can be integrated with any image generation model.

Paper Structure

This paper contains 24 sections, 2 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Proposed Chain-of-Image Generation (CoIG) framework. CoIG consists of two principal stages. (1) The Compositional Strategy Planner (CSP), in which an LLM decomposes a single complex prompt into a sequence of simpler sub-prompts $\{P_1, \dots, P_n\}$ that serves as a human-readable and monitorable plan for iterative editing. (2) The Autoregressive Refinement Model (ARM) executes this plan by first generating an initial image using a text-to-image model (T2I) $I_1 = T2I_{generate}(P_1)$ and then iteratively refining the image via $I_t = T2I_{edit}(I_{t-1}, P_t)$, treating the previous image $I_{t-1}$ as an explicit state that conditions subsequent generations.
  • Figure 2: Qualitative comparison between our CoIG framework and four baselines—Nano Bananafortin2025introducing, RPGyang2024mastering, LMDlian2023llm, and Promptisthao2023optimizing. Across four complex scenarios that require spatial reasoning and multi-entity coordination (1--4), the baselines frequently struggle with entity collapse and incorrect spatial arrangements. In contrast, CoIG decomposes the prompt into explicit, monitorable steps (shown at the bottom of each panel), ensuring precise object placement and correct attribute binding. For additional examples, please see the supplementary material in Appendix \ref{['app:add_exp']}.
  • Figure 3: Qualitative demonstration of Causal Relevance. Rows 1 and 3 show the final images from two original CoIG sequences. Rows 2 and 4 show final images resulting from single perturbations applied during intermediate steps, which change the color of the apple (row 2) and the banana (row 4). The persistence of these targeted changes confirms the causal link between intermediate steps and the final output.
  • Figure 4: Leveraging monitorability for error correction. CoIG’s stepwise structure exposes intermediate failures, allowing a monitor to intervene. Top: A counting error (five trucks instead of four) is detected in the layout phase and corrected via a targeted "Delete One Truck" instruction. Bottom: A failure to faithfully fill the placeholder for the pine trees is identified, triggering a refinement step that correctly renders the "purple pine trees."
  • Figure 5: Visualizing the monitorable advantage of CoIG. In contrast to the black box Baseline, which suffers from attribute leakage (e.g., bleeding "green" onto the cup in row 1) or ignores texture constraints (e.g., missing the "fluffy" texture in row 3), CoIG's stepwise process ensures precise attribute binding. By isolating semantic components into readable steps, our framework provides direct control correcting these failures. More examples demonstrating readability are shown in Appendix \ref{['app:readbility_more']}.
  • ...and 7 more figures