Table of Contents
Fetching ...

Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen, Sifan Wang, Yan Feng, Peng Pei, Pheng-Ann Heng

TL;DR

TwiG addresses the challenge of long-horizon coherence and nuanced instruction adherence in visual generation by interleaving textual reasoning within the generation loop. It introduces a three-component framework—When to Think, What to Say, and How to Refine—implemented via a unified understanding-generation model and explored through zero-shot prompting, TwiG-50K supervised fine-tuning, and TwiG-GRPO reinforcement learning. The results show strong zero-shot gains, modest but robust improvements with SFT, and substantial benefits from RL with a reward ensemble, demonstrating the feasibility and promise of on-the-fly multimodal reasoning to enhance synthesis. This work provides a new direction for controllable, context-aware generation and offers datasets, prompts, and RL strategies that can be built upon in future multi-modal generative systems.

Abstract

Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.

Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

TL;DR

TwiG addresses the challenge of long-horizon coherence and nuanced instruction adherence in visual generation by interleaving textual reasoning within the generation loop. It introduces a three-component framework—When to Think, What to Say, and How to Refine—implemented via a unified understanding-generation model and explored through zero-shot prompting, TwiG-50K supervised fine-tuning, and TwiG-GRPO reinforcement learning. The results show strong zero-shot gains, modest but robust improvements with SFT, and substantial benefits from RL with a reward ensemble, demonstrating the feasibility and promise of on-the-fly multimodal reasoning to enhance synthesis. This work provides a new direction for controllable, context-aware generation and offers datasets, prompts, and RL strategies that can be built upon in future multi-modal generative systems.

Abstract

Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.

Paper Structure

This paper contains 21 sections, 5 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Interleaving Textual Reasoning throughout Visual Generation. Inspired by the image-interleaved reasoning in textual responses su2025thinkingzheng2025deepeyesopenai_o3chen2025mint, we reverse the modality flow and weave textual thoughts into the unfolding canvas, delivering on-the-fly guidance and reflection throughout synthesis.
  • Figure 2: Comparison of Where the Textual Reasoning is Applied in Visual Generation: (a) Think-before-Generationjiang2025t2ifang2025gotliao2025imagegen injects a pre-planning thought prior to the synthesis, limiting fine-grained control and later correction; (b) Think-after-Generationguo2025canqin2025unili2025reflect verifies and revise the image once it is complete, lacking nuanced, timely adjustment with extra inference cost; (c) Our Thinking-while-Generating interleaves thoughts and reflections throughout the synthesis, providing on-the-fly. co-evolving guidance.
  • Figure 3: Overall Pipeline of Thinking-while-Generating. The framework comprises three components: When to Think for globally determining the interleaved generation schedule; What to Say for producing the step-by-step textual thought as fine-grained guidance; and How to Refine for a region-level reflection on the current canvas with optional corrective updates. $\mathrm{ULM}_{u}$ and $\mathrm{ULM}_{g}$ denote to apply a single ULM for understanding and generation, respectively.
  • Figure 4: Illustration of Interleaved Token Sequence: (a) In What to Say, the textual pre-context extends from $\{\tau_j\}_{j<k}$ to $\{\tau_j\}_{j\le k}$ ($K = 2$), guiding the generation of the next $\mathcal{V}_k$ while leaving the earlier $\{\mathcal{V}_j\}_{j<k}$ untouched; (b) In How to Refine, the thought ${\tau}_{k}$ is revised to $\hat{\tau}_{k}$, and only the local region $\hat{\mathcal{V}}_k$ is re-generated to replace $\mathcal{V}_k$. Neither operation requires the ULM to possess image-to-image capabilities, and both preserve a single text-to-image generation trajectory without launching a fresh pass or full re-generation.
  • Figure 5: Qualitative Comparison of TwiG Variants: the baseline (Janus-Pro-7B chen2025janus), TwiG-ZS, -SFT, and -RL. Our method demonstrates progressive improvements in compositional fidelity, object counting, and visual realism.
  • ...and 2 more figures