Table of Contents
Fetching ...

Interleaving Reasoning for Better Text-to-Image Generation

Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, Junbo Qiao, Yue Guo, Yao Hu, Zhenfei Yin, Philip Torr, Yu Cheng, Wanli Ouyang, Shaohui Lin

TL;DR

IRG introduces a two-turn interleaving reasoning paradigm for text-to-image generation, enabling a text-based thinking phase followed by image synthesis and a reflection-based refinement to boost visual fidelity. Built on a unified transformer, IRG employs a two-stage IRGL training regime and the IRGL-300K dataset to learn both initial thinking and improving reasoning, as well as full thinking–image trajectories. Across GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, IRG delivers state-of-the-art gains and notable improvements in fine-grained details, texture, and aesthetics. The approach demonstrates the value of structured cross-modal reasoning for enhancing instruction following and detail preservation in T2I generation, with public release of data and code forthcoming.

Abstract

Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .

Interleaving Reasoning for Better Text-to-Image Generation

TL;DR

IRG introduces a two-turn interleaving reasoning paradigm for text-to-image generation, enabling a text-based thinking phase followed by image synthesis and a reflection-based refinement to boost visual fidelity. Built on a unified transformer, IRG employs a two-stage IRGL training regime and the IRGL-300K dataset to learn both initial thinking and improving reasoning, as well as full thinking–image trajectories. Across GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, IRG delivers state-of-the-art gains and notable improvements in fine-grained details, texture, and aesthetics. The approach demonstrates the value of structured cross-modal reasoning for enhancing instruction following and detail preservation in T2I generation, with public release of data and code forthcoming.

Abstract

Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .

Paper Structure

This paper contains 29 sections, 3 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: As shown in (a), we illustrate an example of Interleaving Reasoning Generation (IRG). Given a prompt, the model first produces a text‑based reasoning process and then generates an image conditioned on that reasoning. Next, building upon the initial image, the model reflects on how to improve its quality and produces a refined image through this reflection process. IRG can substantially enhance image generation quality. For instance, in the top case of (a), IRG improves upon the previous generated image via multi‑turn reasoning, enhancing rendering textures, shadow realism, and other visual properties. In the bottom case of (a), IRG significantly improves fine‑grained details, such as the delicate structures of fingers—highlighted within the red box (as detailed in (b)). As shown in (c), compared to current SoTA models, our proposed IRG achieves clearly superior performance across multiple mainstream T2I benchmarks.
  • Figure 2: Visualization results of IRG at 1024×1024 resolution. The examples are selected from WISE niu2025wise, TIIF wei2025tiifbenchdoest2imodel, and GenAI-Bench li2024genaibenchevaluatingimprovingcompositional.
  • Figure 3: Overview of our proposed IRG training and inference pipeline. IRG learns the text-based thinking process and the complete high-quality image generation pipeline under six decomposed learning modes. During inference, we introduce a dedicated CFG condition design ho2022classifier for IRG’s improved image generation steps.
  • Figure 4: Visualization comparison results of BAGEL deng2025emerging, BAGEL w/ self-CoT deng2025emerging, IRG reasoning step 1 and our proposed IRG at 1024×1024 resolution. The examples are selected from WISE niu2025wise and GenAI-Bench li2024genaibenchevaluatingimprovingcompositional. Red boxes highlight the fine-grained details that have obvious flaws.