Table of Contents
Fetching ...

MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models

Jusheng Zhang, Kaitong Cai, Xiaoyang Guo, Sidi Liu, Qinhan Lv, Ruiqi Chen, Jing Yang, Yijia Fan, Xiaofei Sun, Jian Wang, Ziliang Chen, Liang Lin, Keze Wang

TL;DR

<3-5 sentence high-level summary>

Abstract

The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), enabling them to solve complex visual reasoning problems. Yet a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Existing benchmarks emphasize generation but neglect verification, i.e., the capacity to assess whether a reasoning chain is both visually consistent and logically valid. To fill this gap, we introduce MM-CoT, a diagnostic benchmark specifically designed to probe the visual grounding and logical coherence of CoT reasoning in MMs. Instead of generating free-form explanations, models must select the sole event chain that satisfies two orthogonal constraints: (i) visual consistency, ensuring all steps are anchored in observable evidence, and (ii) logical coherence, ensuring causal and commonsense validity. Adversarial distractors are engineered to violate one of these constraints, exposing distinct reasoning failures. We evaluate leading vision-language models on MM-CoT and find that even the most advanced systems struggle, revealing a sharp discrepancy between generative fluency and true reasoning fidelity. MM-CoT shows low correlation with existing benchmarks, confirming that it measures a unique combination of visual grounding and logical reasoning. This benchmark provides a foundation for developing future models that reason not just plausibly, but faithfully and coherently within the visual world.

MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models

TL;DR

<3-5 sentence high-level summary>

Abstract

The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), enabling them to solve complex visual reasoning problems. Yet a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Existing benchmarks emphasize generation but neglect verification, i.e., the capacity to assess whether a reasoning chain is both visually consistent and logically valid. To fill this gap, we introduce MM-CoT, a diagnostic benchmark specifically designed to probe the visual grounding and logical coherence of CoT reasoning in MMs. Instead of generating free-form explanations, models must select the sole event chain that satisfies two orthogonal constraints: (i) visual consistency, ensuring all steps are anchored in observable evidence, and (ii) logical coherence, ensuring causal and commonsense validity. Adversarial distractors are engineered to violate one of these constraints, exposing distinct reasoning failures. We evaluate leading vision-language models on MM-CoT and find that even the most advanced systems struggle, revealing a sharp discrepancy between generative fluency and true reasoning fidelity. MM-CoT shows low correlation with existing benchmarks, confirming that it measures a unique combination of visual grounding and logical reasoning. This benchmark provides a foundation for developing future models that reason not just plausibly, but faithfully and coherently within the visual world.

Paper Structure

This paper contains 43 sections, 2 equations, 5 figures, 11 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of MM-CoT. Given an image or video, the model must select the only event chain that is both visually grounded and logically coherent, while rejecting distractor chains containing visual inconsistencies (e.g., altered key elements) or causal/temporal violations.
  • Figure 2: Qualitative visualization of MM-CoT samples. Each instance contains multiple reasoning chains associated with an image or video input. Only one chain (highlighted in blue) satisfies both visual grounding and logical coherence, while others introduce visually inconsistent (red) or logically incoherent (orange) distractors. This design enables detailed analysis of reasoning failure modes across modalities.
  • Figure 3: Qualitative visualization of MM-CoT samples (video modality). Given a video clip, MM-CoT provides several candidate event chains describing how the scene evolves. Exactly one chain (in blue) is both visually grounded and causally coherent, while the others inject visual inconsistencies (red) or causal/temporal violations (orange). This setting probes whether models can verify multi-step reasoning across time, not merely describe the scene.
  • Figure 4: Shows the dominant error categories for Qwen2.5-VL-72B and GPT-5 on image-based tasks.
  • Figure 5: The dominant error categories for Qwen2.5-VL-72B and GPT-5 on video-based tasks.