Table of Contents
Fetching ...

MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps

Xiongtao Zhou, Jie He, Lanyu Chen, Jingyu Li, Haojing Chen, Víctor Gutiérrez-Basulto, Jeff Z. Pan, Hanjie Chen

TL;DR

MiCEval addresses the lack of automated, fine-grained evaluation of multimodal chain-of-thought by decomposing MCoT outputs into image-descriptive and reasoning steps and introducing a multi-task, step-level and MCoT-level evaluation framework. It builds a fine-grained, human-annotated MiCEval dataset (903 valid MCoTs, 2,889 steps) and defines multimodal correctness metrics that combine description and reasoning quality via geometric means, enabling MLLMs to verify and evaluate MCoTs. Across two experimental tracks—MLLMs as verifiers and evaluators—MiCEval demonstrates closer alignment with human judgments than cosine-similarity baselines or fine-tuning approaches, while highlighting persistent gaps in current MLLMs’ visual and complex reasoning capabilities. The work offers practical impact for better automated assessment of MCoT outputs and downstream filtering of high-quality reasoning in multimodal systems, with code and data available at the MiCEval GitHub repository.

Abstract

Multimodal Chain of Thought (MCoT) is a popular prompting strategy for improving the performance of multimodal large language models (MLLMs) across a range of complex reasoning tasks. Despite its popularity, there is a notable absence of automated methods for evaluating the quality of reasoning steps in MCoT. To address this gap, we propose Multimodal Chain-of-Thought Evaluation (MiCEval), a framework designed to assess the correctness of reasoning chains by evaluating the quality of both the description and each reasoning step. The evaluation of the description component focuses on the accuracy of the image descriptions, while the reasoning step evaluates the quality of each step as it is conditionally generated based on the preceding steps. MiCEval is built upon a fine-grained dataset with annotations that rate each step according to correctness, relevance, and informativeness. Extensive experiments on four state-of-the-art MLLMs show that step-wise evaluations using MiCEval align more closely with human judgments compared to existing methods based on cosine similarity or fine-tuning approaches. MiCEval datasets and code can be found in https://github.com/alenai97/MiCEval.

MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps

TL;DR

MiCEval addresses the lack of automated, fine-grained evaluation of multimodal chain-of-thought by decomposing MCoT outputs into image-descriptive and reasoning steps and introducing a multi-task, step-level and MCoT-level evaluation framework. It builds a fine-grained, human-annotated MiCEval dataset (903 valid MCoTs, 2,889 steps) and defines multimodal correctness metrics that combine description and reasoning quality via geometric means, enabling MLLMs to verify and evaluate MCoTs. Across two experimental tracks—MLLMs as verifiers and evaluators—MiCEval demonstrates closer alignment with human judgments than cosine-similarity baselines or fine-tuning approaches, while highlighting persistent gaps in current MLLMs’ visual and complex reasoning capabilities. The work offers practical impact for better automated assessment of MCoT outputs and downstream filtering of high-quality reasoning in multimodal systems, with code and data available at the MiCEval GitHub repository.

Abstract

Multimodal Chain of Thought (MCoT) is a popular prompting strategy for improving the performance of multimodal large language models (MLLMs) across a range of complex reasoning tasks. Despite its popularity, there is a notable absence of automated methods for evaluating the quality of reasoning steps in MCoT. To address this gap, we propose Multimodal Chain-of-Thought Evaluation (MiCEval), a framework designed to assess the correctness of reasoning chains by evaluating the quality of both the description and each reasoning step. The evaluation of the description component focuses on the accuracy of the image descriptions, while the reasoning step evaluates the quality of each step as it is conditionally generated based on the preceding steps. MiCEval is built upon a fine-grained dataset with annotations that rate each step according to correctness, relevance, and informativeness. Extensive experiments on four state-of-the-art MLLMs show that step-wise evaluations using MiCEval align more closely with human judgments compared to existing methods based on cosine similarity or fine-tuning approaches. MiCEval datasets and code can be found in https://github.com/alenai97/MiCEval.

Paper Structure

This paper contains 45 sections, 10 equations, 30 figures, 20 tables.

Figures (30)

  • Figure 1: We exemplify how CLIP and ReCEval did not choose the correct MCoT answer from the two model-generated MCoT answers, but MiCEval succeeded.
  • Figure 2: Our work consists of two main parts: (a) sampling questions from the source datasets, generating MCoT answers using four MLLMs, followed by high-quality human annotation and filtering to create the MiCEval dataset; (b) a detailed illustration of our MiCEval framework.
  • Figure 3: A complete flowchart of the MCoT annotation process. We first determine the type of each step and then annotate its correctness based on the type of step. Once all steps in an MCoT answer are annotated, we evaluate the correctness of the entire MCoT.
  • Figure 4: The relationship between the average accuracy of three MLLMs across all Pairwise Comparison tasks and the number of shots on two splits.
  • Figure 5: The MCoT generators distribution of each splits.
  • ...and 25 more figures