Table of Contents
Fetching ...

Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought

Zihui Cheng, Qiguang Chen, Xiao Xu, Jiaqi Wang, Weiyun Wang, Hao Fei, Yidong Wang, Alex Jinpeng Wang, Zhi Chen, Wanxiang Che, Libo Qin

TL;DR

This work investigates why multimodal chain-of-thought (MCoT) improves large vision-language models by introducing the concept of visual thoughts as intermediate, cache-like cross-modal representations. It unifies Textual-MCoT and Interleaved-MCoT through four concrete expressions (N-LANG, S-LANG, E-IMG, G-IMG), systematically evaluating their effectiveness, efficiency, and dependencies across tasks and architectures. The study reveals that visual thoughts facilitate deeper, more efficient reasoning by re-routing visual information through internal caches and attention pathways, with performance gains tied to the clarity and conciseness of the visual thought expressions. It also analyzes internal mechanisms, showing VT acts as a bridge to deeper transformer layers, and discusses limitations and broader impacts to guide future MCoT research.

Abstract

Large Vision-Language Models (LVLMs) have achieved significant success in multimodal tasks, with multimodal chain-of-thought (MCoT) further enhancing performance and interpretability. Recent MCoT methods fall into two categories: (i) Textual-MCoT (T-MCoT), which takes multimodal input and produces textual output; and (ii) Interleaved-MCoT (I-MCoT), which generates interleaved image-text outputs. Despite advances in both approaches, the mechanisms driving these improvements are not fully understood. To fill this gap, we first reveal that MCoT boosts LVLMs by incorporating visual thoughts, which convey image information to the reasoning process regardless of the MCoT format, depending only on clarity and conciseness of expression. Furthermore, to explore visual thoughts systematically, we define four distinct forms of visual thought expressions and analyze them comprehensively. Our findings demonstrate that these forms differ in clarity and conciseness, yielding varying levels of MCoT improvement. Additionally, we explore the internal nature of visual thoughts, finding that visual thoughts serve as intermediaries between the input image and reasoning to deeper transformer layers, enabling more advanced visual information transmission. We hope that the visual thoughts can inspire further breakthroughs for future MCoT research.

Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought

TL;DR

This work investigates why multimodal chain-of-thought (MCoT) improves large vision-language models by introducing the concept of visual thoughts as intermediate, cache-like cross-modal representations. It unifies Textual-MCoT and Interleaved-MCoT through four concrete expressions (N-LANG, S-LANG, E-IMG, G-IMG), systematically evaluating their effectiveness, efficiency, and dependencies across tasks and architectures. The study reveals that visual thoughts facilitate deeper, more efficient reasoning by re-routing visual information through internal caches and attention pathways, with performance gains tied to the clarity and conciseness of the visual thought expressions. It also analyzes internal mechanisms, showing VT acts as a bridge to deeper transformer layers, and discusses limitations and broader impacts to guide future MCoT research.

Abstract

Large Vision-Language Models (LVLMs) have achieved significant success in multimodal tasks, with multimodal chain-of-thought (MCoT) further enhancing performance and interpretability. Recent MCoT methods fall into two categories: (i) Textual-MCoT (T-MCoT), which takes multimodal input and produces textual output; and (ii) Interleaved-MCoT (I-MCoT), which generates interleaved image-text outputs. Despite advances in both approaches, the mechanisms driving these improvements are not fully understood. To fill this gap, we first reveal that MCoT boosts LVLMs by incorporating visual thoughts, which convey image information to the reasoning process regardless of the MCoT format, depending only on clarity and conciseness of expression. Furthermore, to explore visual thoughts systematically, we define four distinct forms of visual thought expressions and analyze them comprehensively. Our findings demonstrate that these forms differ in clarity and conciseness, yielding varying levels of MCoT improvement. Additionally, we explore the internal nature of visual thoughts, finding that visual thoughts serve as intermediaries between the input image and reasoning to deeper transformer layers, enabling more advanced visual information transmission. We hope that the visual thoughts can inspire further breakthroughs for future MCoT research.

Paper Structure

This paper contains 57 sections, 9 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Comparison between (a) Textual MCoT (T-MCoT) with purely textual rationale, and (b) Interleaved MCoT (I-MCoT) with the image-text interleaved rationale. VT: visual thoughts.
  • Figure 2: Comparison of multimodal reasoning from a computer‐system perspective: (a) visual thoughts as an internal visual cache versus (b) direct access to raw images as external storage.
  • Figure 3: Visual Thoughts in textual expression (a) and visual expression (b). Specifically, the textual expression includes N-LANG and S-LANG, while the visual expression includes E-IMG and G-IMG.
  • Figure 4: Effectiveness Verification for Visual Thoughts. More details are in Appendix \ref{['appendix: effectiveness verification']}.
  • Figure 5: The proportion of performance improvement rate across tasks.
  • ...and 7 more figures