Chain-of-Image Generation: Toward Monitorable and Controllable Image Generation
Young Kyung Kim, Oded Schlesinger, Yuzhou Zhao, J. Matias Di Martino, Guillermo Sapiro
TL;DR
This work tackles the opacity of text-to-image generation by introducing Chain-of-Image Generation (CoIG), a two-stage framework that uses an LLM to decompose complex prompts into a sequence of semantically focused sub-prompts (CSP) and an Autoregressive Refinement Model (ARM) to iteratively generate and edit images. By enforcing a human-readable, stepwise process and a compositional lock, CoIG achieves improved monitorability via two metrics: CoIG Readability and Causal Relevance, while mitigating entity collapse through sequential task decomposition. The approach is model-agnostic and demonstrates competitive compositionality across standard benchmarks, along with significant improvements on the Entity Collapse Benchmark, highlighting enhanced robustness and reliability for monitorable image generation. Overall, CoIG advances trustworthy image synthesis by making the generative process observable, controllable, and verifiably causal to the final outputs, with broad practical implications for safety and accountability in AI systems.
Abstract
While state-of-the-art image generation models achieve remarkable visual quality, their internal generative processes remain a "black box." This opacity limits human observation and intervention, and poses a barrier to ensuring model reliability, safety, and control. Furthermore, their non-human-like workflows make them difficult for human observers to interpret. To address this, we introduce the Chain-of-Image Generation (CoIG) framework, which reframes image generation as a sequential, semantic process analogous to how humans create art. Similar to the advantages in monitorability and performance that Chain-of-Thought (CoT) brought to large language models (LLMs), CoIG can produce equivalent benefits in text-to-image generation. CoIG utilizes an LLM to decompose a complex prompt into a sequence of simple, step-by-step instructions. The image generation model then executes this plan by progressively generating and editing the image. Each step focuses on a single semantic entity, enabling direct monitoring. We formally assess this property using two novel metrics: CoIG Readability, which evaluates the clarity of each intermediate step via its corresponding output; and Causal Relevance, which quantifies the impact of each procedural step on the final generated image. We further show that our framework mitigates entity collapse by decomposing the complex generation task into simple subproblems, analogous to the procedural reasoning employed by CoT. Our experimental results indicate that CoIG substantially enhances quantitative monitorability while achieving competitive compositional robustness compared to established baseline models. The framework is model-agnostic and can be integrated with any image generation model.
