Table of Contents
Fetching ...

CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models

Zihui Cheng, Qiguang Chen, Jin Zhang, Hao Fei, Xiaocheng Feng, Wanxiang Che, Min Li, Libo Qin

TL;DR

CoMT addresses a key limitation of existing Multimodal Chain-of-Thought benchmarks by requiring both multi-modal input and multi-modal reasoning outputs. It presents four visual-operation tasks—Visual Creation, Deletion, Update, and Selection—built from established datasets and a standardized template, enabling rich, multi-step visual reasoning. Across a range of LVLMs and prompting strategies, the results show large gaps to human performance, with in-context learning using multi-modal rationales offering the most promise and with performance correlating strongly with multi-modal alignment metrics like CLIPScore. The benchmark thus highlights the importance of integrating visual generation into reasoning and provides a platform to guide future improvements in multi-modal thought processes.

Abstract

Large Vision-Language Models (LVLMs) have recently demonstrated amazing success in multi-modal tasks, including advancements in Multi-modal Chain-of-Thought (MCoT) reasoning. Despite these successes, current benchmarks still follow a traditional paradigm with multi-modal input and text-modal output, which leads to significant drawbacks such as missing visual operations and vague expressions. Motivated by this, we introduce a novel Chain of Multi-modal Thought (CoMT) benchmark to address these limitations. Different from the traditional MCoT benchmark, CoMT requires both multi-modal input and multi-modal reasoning output, aiming to mimic human-like reasoning that inherently integrates visual operation. Specifically, CoMT consists of four categories: (1) Visual Creation, (2) Visual Deletion, (3) Visual Update, and (4) Visual Selection to comprehensively explore complex visual operations and concise expression in real scenarios. We evaluate various LVLMs and strategies on CoMT, revealing some key insights into the capabilities and limitations of the current approaches. We hope that CoMT can inspire more research on introducing multi-modal generation into the reasoning process.

CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models

TL;DR

CoMT addresses a key limitation of existing Multimodal Chain-of-Thought benchmarks by requiring both multi-modal input and multi-modal reasoning outputs. It presents four visual-operation tasks—Visual Creation, Deletion, Update, and Selection—built from established datasets and a standardized template, enabling rich, multi-step visual reasoning. Across a range of LVLMs and prompting strategies, the results show large gaps to human performance, with in-context learning using multi-modal rationales offering the most promise and with performance correlating strongly with multi-modal alignment metrics like CLIPScore. The benchmark thus highlights the importance of integrating visual generation into reasoning and provides a platform to guide future improvements in multi-modal thought processes.

Abstract

Large Vision-Language Models (LVLMs) have recently demonstrated amazing success in multi-modal tasks, including advancements in Multi-modal Chain-of-Thought (MCoT) reasoning. Despite these successes, current benchmarks still follow a traditional paradigm with multi-modal input and text-modal output, which leads to significant drawbacks such as missing visual operations and vague expressions. Motivated by this, we introduce a novel Chain of Multi-modal Thought (CoMT) benchmark to address these limitations. Different from the traditional MCoT benchmark, CoMT requires both multi-modal input and multi-modal reasoning output, aiming to mimic human-like reasoning that inherently integrates visual operation. Specifically, CoMT consists of four categories: (1) Visual Creation, (2) Visual Deletion, (3) Visual Update, and (4) Visual Selection to comprehensively explore complex visual operations and concise expression in real scenarios. We evaluate various LVLMs and strategies on CoMT, revealing some key insights into the capabilities and limitations of the current approaches. We hope that CoMT can inspire more research on introducing multi-modal generation into the reasoning process.

Paper Structure

This paper contains 49 sections, 2 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Comparison between (a) traditional multi-modal CoT and (b) chain of multi-modal thought, where images in rationales are needed to be generated from LVLMs to assist textual reasoning in rationale.
  • Figure 2: The overall annotation process for four tasks of CoMT, which consists of (a)visual creation, (b)visual deletion, (c)visual update, and (d)visual selection.
  • Figure 3: Distribution of CoMT tasks across four types of image processing.
  • Figure 4: Analysis of the correlation between the model performance and the quality of rationale for different LVLMs based on ROSCOE golovneva2023roscoe.
  • Figure 5: CLIPScore of LVLMs on 4 tasks within CoMT. The x-axis represents the CLIPScore, and the y-axis represents the accuracy.
  • ...and 6 more figures