Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study
Yujie Lin, Ante Wang, Moye Chen, Jingyao Liu, Hao Liu, Jinsong Su, Xinyan Xiao
TL;DR
This study extends chain-of-thought reasoning to multi-modal inputs and evaluates inference-time scaling using sampling-based (Self-Consistency, Best-of-N) and tree-search-based (Beam Search, MCTS) methods, guided by a consistency-enhanced verifier. Across 10 challenging tasks spanning geometry, mathematics, and visual question answering, multi-modal thought consistently outperforms text-only thinking and shows a higher potential upper bound, albeit at substantially higher token costs. An ablation study demonstrates that the consistency-enhanced verifier provides more reliable guidance and that its effectiveness scales with the number of verifications, while tree-search methods reduce intermediate errors more effectively than sampling-based approaches. The work highlights the promise and challenges of multi-modal inference-time scaling, suggesting future directions in token-efficient visual processing, stronger verifiers, and extending the paradigm to other modalities.
Abstract
Recently, inference-time scaling of chain-of-thought (CoT) has been demonstrated as a promising approach for addressing multi-modal reasoning tasks. While existing studies have predominantly centered on text-based thinking, the integration of both visual and textual modalities within the reasoning process remains unexplored. In this study, we pioneer the exploration of inference-time scaling with multi-modal thought, aiming to bridge this gap. To provide a comprehensive analysis, we systematically investigate popular sampling-based and tree search-based inference-time scaling methods on 10 challenging tasks spanning various domains. Besides, we uniformly adopt a consistency-enhanced verifier to ensure effective guidance for both methods across different thought paradigms. Results show that multi-modal thought promotes better performance against conventional text-only thought, and blending the two types of thought fosters more diverse thinking. Despite these advantages, multi-modal thoughts necessitate higher token consumption for processing richer visual inputs, which raises concerns in practical applications. We hope that our findings on the merits and drawbacks of this research line will inspire future works in the field.
