Table of Contents
Fetching ...

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

Jiaqi Liao, Zhengyuan Yang, Linjie Li, Dianqi Li, Kevin Lin, Yu Cheng, Lijuan Wang

TL;DR

This work tackles Text-to-Image In-context Learning by introducing ImageGen-CoT, a chain-of-thought reasoning prior to image generation that is automatically constructed and used to fine-tune Unified Multimodal LLMs. It combines a two-stage inference, an automated dataset-construction pipeline, and test-time Best-of-N and hybrid scaling to drastically improve T2I-ICL performance on benchmarks like CoBSAT and DreamBench++. Key findings show substantial gains from prompting, further improvements from fine-tuning with the ImageGen-CoT dataset, and additional boosts from hybrid scaling, establishing a new paradigm for multimodal reasoning and generation with practical impact on in-context learning and personalization tasks.

Abstract

In this work, we study the problem of Text-to-Image In-Context Learning (T2I-ICL). While Unified Multimodal LLMs (MLLMs) have advanced rapidly in recent years, they struggle with contextual reasoning in T2I-ICL scenarios. To address this limitation, we propose a novel framework that incorporates a thought process called ImageGen-CoT prior to image generation. To avoid generating unstructured ineffective reasoning steps, we develop an automatic pipeline to curate a high-quality ImageGen-CoT dataset. We then fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities. To further enhance performance, we explore test-time scale-up strategies and propose a novel hybrid scaling approach. This approach first generates multiple ImageGen-CoT chains and then produces multiple images for each chain via sampling. Extensive experiments demonstrate the effectiveness of our proposed method. Notably, fine-tuning with the ImageGen-CoT dataset leads to a substantial 80\% performance gain for SEED-X on T2I-ICL tasks. See our project page at https://ImageGen-CoT.github.io/. Code and model weights will be open-sourced.

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

TL;DR

This work tackles Text-to-Image In-context Learning by introducing ImageGen-CoT, a chain-of-thought reasoning prior to image generation that is automatically constructed and used to fine-tune Unified Multimodal LLMs. It combines a two-stage inference, an automated dataset-construction pipeline, and test-time Best-of-N and hybrid scaling to drastically improve T2I-ICL performance on benchmarks like CoBSAT and DreamBench++. Key findings show substantial gains from prompting, further improvements from fine-tuning with the ImageGen-CoT dataset, and additional boosts from hybrid scaling, establishing a new paradigm for multimodal reasoning and generation with practical impact on in-context learning and personalization tasks.

Abstract

In this work, we study the problem of Text-to-Image In-Context Learning (T2I-ICL). While Unified Multimodal LLMs (MLLMs) have advanced rapidly in recent years, they struggle with contextual reasoning in T2I-ICL scenarios. To address this limitation, we propose a novel framework that incorporates a thought process called ImageGen-CoT prior to image generation. To avoid generating unstructured ineffective reasoning steps, we develop an automatic pipeline to curate a high-quality ImageGen-CoT dataset. We then fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities. To further enhance performance, we explore test-time scale-up strategies and propose a novel hybrid scaling approach. This approach first generates multiple ImageGen-CoT chains and then produces multiple images for each chain via sampling. Extensive experiments demonstrate the effectiveness of our proposed method. Notably, fine-tuning with the ImageGen-CoT dataset leads to a substantial 80\% performance gain for SEED-X on T2I-ICL tasks. See our project page at https://ImageGen-CoT.github.io/. Code and model weights will be open-sourced.

Paper Structure

This paper contains 24 sections, 9 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Performance comparison on CoBSAT and DreamBench++ benchmarks. Our method significantly improves SEED-X's performance through progressive enhancements: adding ImageGen-CoT, fine-tuning with the ImageGen-CoT dataset, and applying test-time scaling strategies.
  • Figure 2: Main Pipeline. (a) Data Collection Pipeline: An automated iterative process where the MLLM acts as Generator, Selector, Critic, and Refiner to produce high-quality ImageGen-CoT (reasoning chains) and aligned images. (b) Training Pipeline: Fine-tuning unified MLLMs on the collected ImageGen-CoT dataset to enhance contextual reasoning and image generation. (c) Test-Time Scaling: Strategies for performance improvement via hybrid scaling during inference.
  • Figure 3: Test-time scaling strategies comparison. We conducted a comprehensive evaluation of three distinct scaling strategies: Multi-Chain Scaling, Single-Chain Scaling, and Hybrid Scaling, examining their performance across varying numbers of generated outputs (N=2,4,8,16). The experimental results are presented in two figures, with the left figure showing results on CoBSAT and the right figure displaying results on Dreambench++. The red numbers indicate the performance improvements achieved by Hybrid Scaling compared to Single-Chain Scaling.
  • Figure 4: Qualitative Results. Comparison of generation results on COBSAT (top) and Dreambench+ (bottom) using baseline SEED-X, SEED-X with ImageGen-CoT prompting, and SEED-X fine-tuned with ImageGen-CoT dataset.