Table of Contents
Fetching ...

Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study

Yujie Lin, Ante Wang, Moye Chen, Jingyao Liu, Hao Liu, Jinsong Su, Xinyan Xiao

TL;DR

This study extends chain-of-thought reasoning to multi-modal inputs and evaluates inference-time scaling using sampling-based (Self-Consistency, Best-of-N) and tree-search-based (Beam Search, MCTS) methods, guided by a consistency-enhanced verifier. Across 10 challenging tasks spanning geometry, mathematics, and visual question answering, multi-modal thought consistently outperforms text-only thinking and shows a higher potential upper bound, albeit at substantially higher token costs. An ablation study demonstrates that the consistency-enhanced verifier provides more reliable guidance and that its effectiveness scales with the number of verifications, while tree-search methods reduce intermediate errors more effectively than sampling-based approaches. The work highlights the promise and challenges of multi-modal inference-time scaling, suggesting future directions in token-efficient visual processing, stronger verifiers, and extending the paradigm to other modalities.

Abstract

Recently, inference-time scaling of chain-of-thought (CoT) has been demonstrated as a promising approach for addressing multi-modal reasoning tasks. While existing studies have predominantly centered on text-based thinking, the integration of both visual and textual modalities within the reasoning process remains unexplored. In this study, we pioneer the exploration of inference-time scaling with multi-modal thought, aiming to bridge this gap. To provide a comprehensive analysis, we systematically investigate popular sampling-based and tree search-based inference-time scaling methods on 10 challenging tasks spanning various domains. Besides, we uniformly adopt a consistency-enhanced verifier to ensure effective guidance for both methods across different thought paradigms. Results show that multi-modal thought promotes better performance against conventional text-only thought, and blending the two types of thought fosters more diverse thinking. Despite these advantages, multi-modal thoughts necessitate higher token consumption for processing richer visual inputs, which raises concerns in practical applications. We hope that our findings on the merits and drawbacks of this research line will inspire future works in the field.

Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study

TL;DR

This study extends chain-of-thought reasoning to multi-modal inputs and evaluates inference-time scaling using sampling-based (Self-Consistency, Best-of-N) and tree-search-based (Beam Search, MCTS) methods, guided by a consistency-enhanced verifier. Across 10 challenging tasks spanning geometry, mathematics, and visual question answering, multi-modal thought consistently outperforms text-only thinking and shows a higher potential upper bound, albeit at substantially higher token costs. An ablation study demonstrates that the consistency-enhanced verifier provides more reliable guidance and that its effectiveness scales with the number of verifications, while tree-search methods reduce intermediate errors more effectively than sampling-based approaches. The work highlights the promise and challenges of multi-modal inference-time scaling, suggesting future directions in token-efficient visual processing, stronger verifiers, and extending the paradigm to other modalities.

Abstract

Recently, inference-time scaling of chain-of-thought (CoT) has been demonstrated as a promising approach for addressing multi-modal reasoning tasks. While existing studies have predominantly centered on text-based thinking, the integration of both visual and textual modalities within the reasoning process remains unexplored. In this study, we pioneer the exploration of inference-time scaling with multi-modal thought, aiming to bridge this gap. To provide a comprehensive analysis, we systematically investigate popular sampling-based and tree search-based inference-time scaling methods on 10 challenging tasks spanning various domains. Besides, we uniformly adopt a consistency-enhanced verifier to ensure effective guidance for both methods across different thought paradigms. Results show that multi-modal thought promotes better performance against conventional text-only thought, and blending the two types of thought fosters more diverse thinking. Despite these advantages, multi-modal thoughts necessitate higher token consumption for processing richer visual inputs, which raises concerns in practical applications. We hope that our findings on the merits and drawbacks of this research line will inspire future works in the field.

Paper Structure

This paper contains 45 sections, 7 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Solving the problem using text-only thought vs. multi-modal thought. Additional visual information from the latter offers richer and more intuitive features, making it easier to yield better subsequent steps.
  • Figure 2: Three types of verifiers investigated in this work, where the classification-based verifier outputs sparse binary scores (0 or 1) and the regression-based verifier provides dense but inaccurate scores. We introduce the consistency-enhanced verifier to compute dense and accurate scores by aggregating multiple evaluations sampled from the classification-based verifier.
  • Figure 3: Performance comparison between text-only thought and multi-modal thought on the Maxflow dataset under Best-of-N (left) and MCTS (right).
  • Figure 4: Performance comparison between text-only thought and multi-modal thought varies with the maximum token consumption on the Maxflow dataset under Self-Consistency.
  • Figure 5: Performance of Self-Consistency on the Maxflow dataset as the number of samples increases, comparing text-only thought, multi-modal thought, and the hybrid form.
  • ...and 3 more figures