Table of Contents
Fetching ...

Interleaved-Modal Chain-of-Thought

Jun Gao, Yongqi Li, Ziqiang Cao, Wenjie Li

TL;DR

The paper addresses the challenge of extending chain-of-thought prompting to vision-language models by generating interleaved multimodal rationales that couple paired images and textual reasoning to guide the final answer. It introduces Interleaved-modal Chain-of-Thought (ICoT) and a plug-and-play Attention-driven Selection (ADS) mechanism that uses a VLM’s own attention maps to insert $n$ fine-grained visual tokens from the input image after a signal token, without parameter updates. Empirical results on M$^3$CoT, ScienceQA, and LLaVA-W show up to 14% improvements over strong multimodal CoT baselines and improved interpretability through traceable interleaved reasoning steps. The work also analyzes the KV-cache alternative and demonstrates that manually designed demonstrations can outperform automatically generated ones, while highlighting open questions about patch count and efficiency.

Abstract

Chain-of-Thought (CoT) prompting elicits large language models (LLMs) to produce a series of intermediate reasoning steps before arriving at the final answer. However, when transitioning to vision-language models (VLMs), their text-only rationales struggle to express the fine-grained associations with the original image. In this paper, we propose an image-incorporated multimodal Chain-of-Thought, named \textbf{Interleaved-modal Chain-of-Thought (ICoT)}, which generates sequential reasoning steps consisting of paired visual and textual rationales to infer the final answer. Intuitively, the novel ICoT requires VLMs to enable the generation of fine-grained interleaved-modal content, which is hard for current VLMs to fulfill. Considering that the required visual information is usually part of the input image, we propose \textbf{Attention-driven Selection (ADS)} to realize ICoT over existing VLMs. ADS intelligently inserts regions of the input image to generate the interleaved-modal reasoning steps with ignorable additional latency. ADS relies solely on the attention map of VLMs without the need for parameterization, and therefore it is a plug-and-play strategy that can be generalized to a spectrum of VLMs. We apply ADS to realize ICoT on two popular VLMs of different architectures. Extensive evaluations of three benchmarks have shown that ICoT prompting achieves substantial performance (up to 14\%) and interpretability improvements compared to existing multimodal CoT prompting methods.

Interleaved-Modal Chain-of-Thought

TL;DR

The paper addresses the challenge of extending chain-of-thought prompting to vision-language models by generating interleaved multimodal rationales that couple paired images and textual reasoning to guide the final answer. It introduces Interleaved-modal Chain-of-Thought (ICoT) and a plug-and-play Attention-driven Selection (ADS) mechanism that uses a VLM’s own attention maps to insert fine-grained visual tokens from the input image after a signal token, without parameter updates. Empirical results on MCoT, ScienceQA, and LLaVA-W show up to 14% improvements over strong multimodal CoT baselines and improved interpretability through traceable interleaved reasoning steps. The work also analyzes the KV-cache alternative and demonstrates that manually designed demonstrations can outperform automatically generated ones, while highlighting open questions about patch count and efficiency.

Abstract

Chain-of-Thought (CoT) prompting elicits large language models (LLMs) to produce a series of intermediate reasoning steps before arriving at the final answer. However, when transitioning to vision-language models (VLMs), their text-only rationales struggle to express the fine-grained associations with the original image. In this paper, we propose an image-incorporated multimodal Chain-of-Thought, named \textbf{Interleaved-modal Chain-of-Thought (ICoT)}, which generates sequential reasoning steps consisting of paired visual and textual rationales to infer the final answer. Intuitively, the novel ICoT requires VLMs to enable the generation of fine-grained interleaved-modal content, which is hard for current VLMs to fulfill. Considering that the required visual information is usually part of the input image, we propose \textbf{Attention-driven Selection (ADS)} to realize ICoT over existing VLMs. ADS intelligently inserts regions of the input image to generate the interleaved-modal reasoning steps with ignorable additional latency. ADS relies solely on the attention map of VLMs without the need for parameterization, and therefore it is a plug-and-play strategy that can be generalized to a spectrum of VLMs. We apply ADS to realize ICoT on two popular VLMs of different architectures. Extensive evaluations of three benchmarks have shown that ICoT prompting achieves substantial performance (up to 14\%) and interpretability improvements compared to existing multimodal CoT prompting methods.

Paper Structure

This paper contains 25 sections, 6 equations, 5 figures, 5 tables, 2 algorithms.

Figures (5)

  • Figure 1: The illustration between multimodal CoT with text-only rationales (Left) and interleaved-modal rationales (Right). Green blocks are correct texts used to infer the final answer. Text-only rationales restrict VLMs to use a rough description to indicate the position of objects. Transparent boxes indicate that these regions are selected and inserted to formulate paired visual and textual rationales in ICoT.
  • Figure 2: The workflow of ADS selecting fine-grained visual rationales. Signal attention represents the attention map of the signal token overall visual tokens.
  • Figure 3: Case studies between ICoT and multimodal CoT with text-only rationales on Chameleon. Three cases are selected according to three typical problems in text-only problems: misunderstanding, overgeneralization, and hallucination. Red blocks indicates the incorrect rationales.
  • Figure 4: The results of ICoT across validation sets of two datasets on both Chameleon and Qwen2-VL, with the number of selected patches set to 32, 64, 128, and 256. The reported scores are normalized for simplicity.
  • Figure 5: The case of demonstration with Fine-grained Visual Information (FVI), which is used in 1-shot ICoT.