Table of Contents
Fetching ...

Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

Timin Gao, Peixian Chen, Mengdan Zhang, Chaoyou Fu, Yunhang Shen, Yan Zhang, Shengchuan Zhang, Xiawu Zheng, Xing Sun, Liujuan Cao, Rongrong Ji

TL;DR

Cantor introduces a perception-decision architecture for multimodal chain-of-thought (CoT) that integrates visual input at the decision-generation stage and assigns sub-tasks to a single MLLM acting as multiple expert modules. This approach reduces hallucinations and enhances high-level reasoning without fine-tuning, demonstrated on ScienceQA and MathVista where Cantor achieves state-of-the-art performance among training-free methods and surpasses several fine-tuned baselines. By enabling modules such as TextIntel Extractor, ObjectQuant Locator, VisionIQ Analyst, and ChartSense Expert, Cantor stimulates high-level, context-rich reasoning and robust visual understanding through a two-stage process: Decision-Generation and Execution. The results indicate that explicit visual context at decision time and expert-module orchestration significantly improve multimodal reasoning, with ablation analyses showing each module contributes meaningful gains and that images provide superior information over captions for complex tasks.

Abstract

With the advent of large language models(LLMs) enhanced by the chain-of-thought(CoT) methodology, visual reasoning problem is usually decomposed into manageable sub-tasks and tackled sequentially with various external tools. However, such a paradigm faces the challenge of the potential "determining hallucinations" in decision-making due to insufficient visual information and the limitation of low-level perception tools that fail to provide abstract summaries necessary for comprehensive reasoning. We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. This paper delves into the realm of multimodal CoT to solve intricate visual reasoning tasks with multimodal large language models(MLLMs) and their cognitive capability. To this end, we propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Cantor first acts as a decision generator and integrates visual inputs to analyze the image and problem, ensuring a closer alignment with the actual context. Furthermore, Cantor leverages the advanced cognitive functions of MLLMs to perform as multifaceted experts for deriving higher-level information, enhancing the CoT generation process. Our extensive experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance across two complex visual reasoning datasets, without necessitating fine-tuning or ground-truth rationales. Project Page: https://ggg0919.github.io/cantor/ .

Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

TL;DR

Cantor introduces a perception-decision architecture for multimodal chain-of-thought (CoT) that integrates visual input at the decision-generation stage and assigns sub-tasks to a single MLLM acting as multiple expert modules. This approach reduces hallucinations and enhances high-level reasoning without fine-tuning, demonstrated on ScienceQA and MathVista where Cantor achieves state-of-the-art performance among training-free methods and surpasses several fine-tuned baselines. By enabling modules such as TextIntel Extractor, ObjectQuant Locator, VisionIQ Analyst, and ChartSense Expert, Cantor stimulates high-level, context-rich reasoning and robust visual understanding through a two-stage process: Decision-Generation and Execution. The results indicate that explicit visual context at decision time and expert-module orchestration significantly improve multimodal reasoning, with ablation analyses showing each module contributes meaningful gains and that images provide superior information over captions for complex tasks.

Abstract

With the advent of large language models(LLMs) enhanced by the chain-of-thought(CoT) methodology, visual reasoning problem is usually decomposed into manageable sub-tasks and tackled sequentially with various external tools. However, such a paradigm faces the challenge of the potential "determining hallucinations" in decision-making due to insufficient visual information and the limitation of low-level perception tools that fail to provide abstract summaries necessary for comprehensive reasoning. We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. This paper delves into the realm of multimodal CoT to solve intricate visual reasoning tasks with multimodal large language models(MLLMs) and their cognitive capability. To this end, we propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Cantor first acts as a decision generator and integrates visual inputs to analyze the image and problem, ensuring a closer alignment with the actual context. Furthermore, Cantor leverages the advanced cognitive functions of MLLMs to perform as multifaceted experts for deriving higher-level information, enhancing the CoT generation process. Our extensive experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance across two complex visual reasoning datasets, without necessitating fine-tuning or ground-truth rationales. Project Page: https://ggg0919.github.io/cantor/ .
Paper Structure (24 sections, 17 figures, 5 tables)

This paper contains 24 sections, 17 figures, 5 tables.

Figures (17)

  • Figure 1: (a) Comparison of visual information on decision generation: Asking GPT-3.5 (without visual context) leads to "determining hallucinations" due to lacking clarity of the image. Cantor (with caption) by introducing visual context through captions, does not encounter this issue. Cantor (with image) is even more precise, improving the rationality of task assignment. (b) Comparison of different visual tools: Low-level specialized perception tools used in traditional approaches only obtain basic data. High-level general cognitive expert acted by MLLM obtains object number relationships, enabling direct and subsequent reasoning.
  • Figure 2: Overview of Cantor and a specific example. Cantor analyzes the image and problem through the Decision Generator, offering the principle analysis of the questions, and providing module selection & Reason, as well as specific task allocation. Subsequently, MLLM acts as various expert modules to execute sub-tasks. Finally, Cantor synthesizes and contemplates through the Answer Generator, providing the final answer.
  • Figure 3: Proportions of Cantor's invocation of expert modules across three types of questions on ScienceQA.
  • Figure 4: The prompt of the Decision-Generation stage.
  • Figure 5: In-context Learning Examples on ScienceQA.
  • ...and 12 more figures