Table of Contents
Fetching ...

DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

Dongzhi Jiang, Renrui Zhang, Haodong Li, Zhuofan Zong, Ziyu Guo, Jun He, Claire Guo, Junyan Ye, Rongyao Fang, Weijia Li, Rui Liu, Hongsheng Li

TL;DR

This work tackles the challenge of planning and rare attribute generation in text-to-image synthesis by introducing DraCo, an interleaved Draft-as-CoT framework that first produces a low-resolution visual draft for concrete planning, then verifies draft-content against the input prompt and refines the image through selective corrections. It couples this with a specialized training dataset, DraCo-240K, to teach atomic correction capabilities and a DraCo-CFG guidance strategy to balance visual semantics and textual corrections. Empirical results on GenEval, ImagineBench, and GenEval++ show DraCo substantially outperforms direct generation and text-only CoT baselines, delivering stronger handling of rare attribute combinations. Overall, DraCo demonstrates the value of integrating visual drafts into multimodal reasoning within unified MLLMs, enabling more robust and controllable text-to-image generation.

Abstract

Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.

DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

TL;DR

This work tackles the challenge of planning and rare attribute generation in text-to-image synthesis by introducing DraCo, an interleaved Draft-as-CoT framework that first produces a low-resolution visual draft for concrete planning, then verifies draft-content against the input prompt and refines the image through selective corrections. It couples this with a specialized training dataset, DraCo-240K, to teach atomic correction capabilities and a DraCo-CFG guidance strategy to balance visual semantics and textual corrections. Empirical results on GenEval, ImagineBench, and GenEval++ show DraCo substantially outperforms direct generation and text-only CoT baselines, delivering stronger handling of rare attribute combinations. Overall, DraCo demonstrates the value of integrating visual drafts into multimodal reasoning within unified MLLMs, enabling more robust and controllable text-to-image generation.

Abstract

Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.

Paper Structure

This paper contains 32 sections, 4 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Conceptual Comparison of CoT Reasoning for T2I Generation. (a) Generation without reasoning. (b) Employing exterior reward models to guide generation. (c) Generating Text CoT before producing image. (d) DraCo: Producing visual draft for detailed planning and verify it with text reasoning, then correct and refine the draft for final output.
  • Figure 2: Visualization of DraCo Output. For each example, the larger image represents the final output, while the smaller image is the visual draft. The corresponding prompt is located in the corner of each set.
  • Figure 3: Framework of DraCo.DraCo contains three steps for generation: draft sketching, draft verification, and corrective refinement.
  • Figure 4: Construction Pipeline and Examples of DraCo-240K. We design specialized data pipelines for each of the three atomic correction capabilities: general correction, instance manipulation, and layout reorganization. We then employ Qwen3-VL qwen3technicalreport to generate prompts and verifications based on the collected image pairs. Finally, we organize the data into two categories for training: corrections needed and corrections not needed.
  • Figure 5: Detailed Visualization of DraCo Output. We showcase the prompt, verification, draft (smaller image), and final output (larger image). DraCo successfully identifies the misalignment within the draft and conducts the correction based on the suggested modification.
  • ...and 3 more figures