Let's Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts

Xu Liu; Yongheng Zhang; Qiguang Chen; Yao Li; Sheng Wang; Libo Qin

Let's Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts

Xu Liu, Yongheng Zhang, Qiguang Chen, Yao Li, Sheng Wang, Libo Qin

Abstract

Recently, Interleaved-modal Chain-of-Thought (ICoT) reasoning has achieved remarkable success by leveraging both multimodal inputs and outputs, attracting increasing attention. While achieving promising performance, current ICoT methods still suffer from two major limitations: (1) Static Visual Thought Positioning, which statically inserts visual information at fixed steps, resulting in inefficient and inflexible reasoning; and (2) Broken Visual Thought Representation, which involves discontinuous and semantically incoherent visual tokens. To address these limitations, we introduce Interleaved-modal Chain-of-Thought reasoning with Dynamic and Precise Visual Thoughts (DaP-ICoT), which incorporates two key components: (1) Dynamic Visual Thought Integration adaptively introduces visual inputs based on reasoning needs, reducing redundancy and improving efficiency. (2) Precise Visual Thought Guidance ensures visual semantically coherent and contextually aligned representations. Experiments across multiple benchmarks and models demonstrate that DaP-ICoT achieves state-of-the-art performance. In addition, DaP-ICoT significantly reduces the number of inserted images, leading to a 72.6% decrease in token consumption, enabling more efficient ICoT reasoning.

Let's Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts

Abstract

Paper Structure (16 sections, 8 equations, 8 figures, 1 table)

This paper contains 16 sections, 8 equations, 8 figures, 1 table.

Introduction
DaP-ICoT Reasoning
Dynamic Visual Thought Integration
Precise Visual Thought Guidance
Experiments and Analysis
Experiments Setting
Main Results
Analysis
1. Both the DVTI and PVTG modules are vital for addressing key ICoT challenges.
2. DaP-ICoT significantly reduces the token consumption of MLLMs.
3. DaP-ICoT reduces resource consumption from image insertions.
4. DaP-ICoT effectively enhances confidence during the reasoning process.
5. The search strategy for the confidence threshold $\tau$.
6. Qualitative Analysis.
Related Work
...and 1 more sections

Figures (8)

Figure 1: (a) Current ICoT: While supporting multimodal inputs and outputs, it suffers from Static Visual Thought Integration, which requires the insertion of visual information after each step, and Broken Visual Thought Representation, in which the inserted visual tokens are lack coherence, resulting in inefficient reasoning. (b) Our DaP-ICoT: It provides Dynamic Visual Thought Integration and Precise Visual Thought Guidance, enabling efficient reasoning.
Figure 2: An overview of Interleaved-modal Chain-of-Thought reasoning with Dynamic and Precise Visual Thoughts (DaP-ICoT), including Dynamic Visual Thought Integration ($\S \ref{['DVTI']}$), and Precise Visual Thought Guidance ($\S \ref{['PVTG']}$).
Figure 3: Ablation Study on Qwen2-VL-7B: "w/o PVTG" indicates removal of Precise Visual Thought Guidance for Visual Cues, and "w/o DVTI" indicates removal of Dynamic Visual Thought Integration for Adaptive Reasoning
Figure 4: A comparison of the total token consumption between DaP-ICoT and baseline methods on the M$^3$CoT benchmark using the Qwen2-VL-7B. DaP-ICoT achieves a 72.6% reduction in token consumption compared to ICoT.
Figure 5: A comparison of image insertion frequency and the number of inserted image tokens between DaP-ICoT and ICoT on the Qwen2-VL-7B model.
...and 3 more figures

Let's Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts

Abstract

Let's Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts

Authors

Abstract

Table of Contents

Figures (8)