Cantor: Inspiring Multimodal Chain-of-Thought of MLLM
Timin Gao, Peixian Chen, Mengdan Zhang, Chaoyou Fu, Yunhang Shen, Yan Zhang, Shengchuan Zhang, Xiawu Zheng, Xing Sun, Liujuan Cao, Rongrong Ji
TL;DR
Cantor introduces a perception-decision architecture for multimodal chain-of-thought (CoT) that integrates visual input at the decision-generation stage and assigns sub-tasks to a single MLLM acting as multiple expert modules. This approach reduces hallucinations and enhances high-level reasoning without fine-tuning, demonstrated on ScienceQA and MathVista where Cantor achieves state-of-the-art performance among training-free methods and surpasses several fine-tuned baselines. By enabling modules such as TextIntel Extractor, ObjectQuant Locator, VisionIQ Analyst, and ChartSense Expert, Cantor stimulates high-level, context-rich reasoning and robust visual understanding through a two-stage process: Decision-Generation and Execution. The results indicate that explicit visual context at decision time and expert-module orchestration significantly improve multimodal reasoning, with ablation analyses showing each module contributes meaningful gains and that images provide superior information over captions for complex tasks.
Abstract
With the advent of large language models(LLMs) enhanced by the chain-of-thought(CoT) methodology, visual reasoning problem is usually decomposed into manageable sub-tasks and tackled sequentially with various external tools. However, such a paradigm faces the challenge of the potential "determining hallucinations" in decision-making due to insufficient visual information and the limitation of low-level perception tools that fail to provide abstract summaries necessary for comprehensive reasoning. We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. This paper delves into the realm of multimodal CoT to solve intricate visual reasoning tasks with multimodal large language models(MLLMs) and their cognitive capability. To this end, we propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Cantor first acts as a decision generator and integrates visual inputs to analyze the image and problem, ensuring a closer alignment with the actual context. Furthermore, Cantor leverages the advanced cognitive functions of MLLMs to perform as multifaceted experts for deriving higher-level information, enhancing the CoT generation process. Our extensive experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance across two complex visual reasoning datasets, without necessitating fine-tuning or ground-truth rationales. Project Page: https://ggg0919.github.io/cantor/ .
