ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

Jiaqi Liao; Zhengyuan Yang; Linjie Li; Dianqi Li; Kevin Lin; Yu Cheng; Lijuan Wang

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

Jiaqi Liao, Zhengyuan Yang, Linjie Li, Dianqi Li, Kevin Lin, Yu Cheng, Lijuan Wang

TL;DR

This work tackles Text-to-Image In-context Learning by introducing ImageGen-CoT, a chain-of-thought reasoning prior to image generation that is automatically constructed and used to fine-tune Unified Multimodal LLMs. It combines a two-stage inference, an automated dataset-construction pipeline, and test-time Best-of-N and hybrid scaling to drastically improve T2I-ICL performance on benchmarks like CoBSAT and DreamBench++. Key findings show substantial gains from prompting, further improvements from fine-tuning with the ImageGen-CoT dataset, and additional boosts from hybrid scaling, establishing a new paradigm for multimodal reasoning and generation with practical impact on in-context learning and personalization tasks.

Abstract

In this work, we study the problem of Text-to-Image In-Context Learning (T2I-ICL). While Unified Multimodal LLMs (MLLMs) have advanced rapidly in recent years, they struggle with contextual reasoning in T2I-ICL scenarios. To address this limitation, we propose a novel framework that incorporates a thought process called ImageGen-CoT prior to image generation. To avoid generating unstructured ineffective reasoning steps, we develop an automatic pipeline to curate a high-quality ImageGen-CoT dataset. We then fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities. To further enhance performance, we explore test-time scale-up strategies and propose a novel hybrid scaling approach. This approach first generates multiple ImageGen-CoT chains and then produces multiple images for each chain via sampling. Extensive experiments demonstrate the effectiveness of our proposed method. Notably, fine-tuning with the ImageGen-CoT dataset leads to a substantial 80\% performance gain for SEED-X on T2I-ICL tasks. See our project page at https://ImageGen-CoT.github.io/. Code and model weights will be open-sourced.

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

TL;DR

Abstract

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)