Table of Contents
Fetching ...

Let's Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts

Xu Liu, Yongheng Zhang, Qiguang Chen, Yao Li, Sheng Wang, Libo Qin

Abstract

Recently, Interleaved-modal Chain-of-Thought (ICoT) reasoning has achieved remarkable success by leveraging both multimodal inputs and outputs, attracting increasing attention. While achieving promising performance, current ICoT methods still suffer from two major limitations: (1) Static Visual Thought Positioning, which statically inserts visual information at fixed steps, resulting in inefficient and inflexible reasoning; and (2) Broken Visual Thought Representation, which involves discontinuous and semantically incoherent visual tokens. To address these limitations, we introduce Interleaved-modal Chain-of-Thought reasoning with Dynamic and Precise Visual Thoughts (DaP-ICoT), which incorporates two key components: (1) Dynamic Visual Thought Integration adaptively introduces visual inputs based on reasoning needs, reducing redundancy and improving efficiency. (2) Precise Visual Thought Guidance ensures visual semantically coherent and contextually aligned representations. Experiments across multiple benchmarks and models demonstrate that DaP-ICoT achieves state-of-the-art performance. In addition, DaP-ICoT significantly reduces the number of inserted images, leading to a 72.6% decrease in token consumption, enabling more efficient ICoT reasoning.

Let's Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts

Abstract

Recently, Interleaved-modal Chain-of-Thought (ICoT) reasoning has achieved remarkable success by leveraging both multimodal inputs and outputs, attracting increasing attention. While achieving promising performance, current ICoT methods still suffer from two major limitations: (1) Static Visual Thought Positioning, which statically inserts visual information at fixed steps, resulting in inefficient and inflexible reasoning; and (2) Broken Visual Thought Representation, which involves discontinuous and semantically incoherent visual tokens. To address these limitations, we introduce Interleaved-modal Chain-of-Thought reasoning with Dynamic and Precise Visual Thoughts (DaP-ICoT), which incorporates two key components: (1) Dynamic Visual Thought Integration adaptively introduces visual inputs based on reasoning needs, reducing redundancy and improving efficiency. (2) Precise Visual Thought Guidance ensures visual semantically coherent and contextually aligned representations. Experiments across multiple benchmarks and models demonstrate that DaP-ICoT achieves state-of-the-art performance. In addition, DaP-ICoT significantly reduces the number of inserted images, leading to a 72.6% decrease in token consumption, enabling more efficient ICoT reasoning.
Paper Structure (16 sections, 8 equations, 8 figures, 1 table)

This paper contains 16 sections, 8 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: (a) Current ICoT: While supporting multimodal inputs and outputs, it suffers from Static Visual Thought Integration, which requires the insertion of visual information after each step, and Broken Visual Thought Representation, in which the inserted visual tokens are lack coherence, resulting in inefficient reasoning. (b) Our DaP-ICoT: It provides Dynamic Visual Thought Integration and Precise Visual Thought Guidance, enabling efficient reasoning.
  • Figure 2: An overview of Interleaved-modal Chain-of-Thought reasoning with Dynamic and Precise Visual Thoughts (DaP-ICoT), including Dynamic Visual Thought Integration ($\S \ref{['DVTI']}$), and Precise Visual Thought Guidance ($\S \ref{['PVTG']}$).
  • Figure 3: Ablation Study on Qwen2-VL-7B: "w/o PVTG" indicates removal of Precise Visual Thought Guidance for Visual Cues, and "w/o DVTI" indicates removal of Dynamic Visual Thought Integration for Adaptive Reasoning
  • Figure 4: A comparison of the total token consumption between DaP-ICoT and baseline methods on the M$^3$CoT benchmark using the Qwen2-VL-7B. DaP-ICoT achieves a 72.6% reduction in token consumption compared to ICoT.
  • Figure 5: A comparison of image insertion frequency and the number of inserted image tokens between DaP-ICoT and ICoT on the Qwen2-VL-7B model.
  • ...and 3 more figures