Can MLLMs Perform Text-to-Image In-Context Learning?
Yuchen Zeng, Wonjun Kang, Yicong Chen, Hyung Il Koo, Kangwook Lee
TL;DR
This work defines Text-to-Image In-Context Learning (T2I-ICL) and introduces CoBSAT, the first benchmark designed to test MLLMs on transforming textual prompts into images or image descriptions via in-context demonstrations. Through extensive evaluation of multiple image-generation and text-only MLLMs, the study finds that multimodal integration and image generation are primary bottlenecks, with image-description tasks easier for some models than image creation. Fine-tuning on CoBSAT and Chain-of-Thought prompting substantially improve T2I-ICL performance for several models, though gains are model-dependent and sometimes counterproductive for others. The results highlight the need for targeted prompt engineering, broader multimodal training, and possibly multimodal-CoT strategies to advance T2I-ICL capabilities, with practical implications for design and evaluation of future MLLMs. CoBSAT and associated code provide a foundation for ongoing research in this underexplored area of multimodal in-context learning.
Abstract
The evolution from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs) has spurred research into extending In-Context Learning (ICL) to its multimodal counterpart. Existing such studies have primarily concentrated on image-to-text ICL. However, the Text-to-Image ICL (T2I-ICL), with its unique characteristics and potential applications, remains underexplored. To address this gap, we formally define the task of T2I-ICL and present CoBSAT, the first T2I-ICL benchmark dataset, encompassing ten tasks. Utilizing our dataset to benchmark six state-of-the-art MLLMs, we uncover considerable difficulties MLLMs encounter in solving T2I-ICL. We identify the primary challenges as the inherent complexity of multimodality and image generation, and show that strategies such as fine-tuning and Chain-of-Thought prompting help to mitigate these difficulties, leading to notable improvements in performance. Our code and dataset are available at https://github.com/UW-Madison-Lee-Lab/CoBSAT.
