Interleaved-Modal Chain-of-Thought
Jun Gao, Yongqi Li, Ziqiang Cao, Wenjie Li
TL;DR
The paper addresses the challenge of extending chain-of-thought prompting to vision-language models by generating interleaved multimodal rationales that couple paired images and textual reasoning to guide the final answer. It introduces Interleaved-modal Chain-of-Thought (ICoT) and a plug-and-play Attention-driven Selection (ADS) mechanism that uses a VLM’s own attention maps to insert $n$ fine-grained visual tokens from the input image after a signal token, without parameter updates. Empirical results on M$^3$CoT, ScienceQA, and LLaVA-W show up to 14% improvements over strong multimodal CoT baselines and improved interpretability through traceable interleaved reasoning steps. The work also analyzes the KV-cache alternative and demonstrates that manually designed demonstrations can outperform automatically generated ones, while highlighting open questions about patch count and efficiency.
Abstract
Chain-of-Thought (CoT) prompting elicits large language models (LLMs) to produce a series of intermediate reasoning steps before arriving at the final answer. However, when transitioning to vision-language models (VLMs), their text-only rationales struggle to express the fine-grained associations with the original image. In this paper, we propose an image-incorporated multimodal Chain-of-Thought, named \textbf{Interleaved-modal Chain-of-Thought (ICoT)}, which generates sequential reasoning steps consisting of paired visual and textual rationales to infer the final answer. Intuitively, the novel ICoT requires VLMs to enable the generation of fine-grained interleaved-modal content, which is hard for current VLMs to fulfill. Considering that the required visual information is usually part of the input image, we propose \textbf{Attention-driven Selection (ADS)} to realize ICoT over existing VLMs. ADS intelligently inserts regions of the input image to generate the interleaved-modal reasoning steps with ignorable additional latency. ADS relies solely on the attention map of VLMs without the need for parameterization, and therefore it is a plug-and-play strategy that can be generalized to a spectrum of VLMs. We apply ADS to realize ICoT on two popular VLMs of different architectures. Extensive evaluations of three benchmarks have shown that ICoT prompting achieves substantial performance (up to 14\%) and interpretability improvements compared to existing multimodal CoT prompting methods.
