Table of Contents
Fetching ...

MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

Kaiyuan Zhang, Chenghao Yang, Zhoufutu Wen, Sihang Yuan, Qiuyue Wang, Chaoyi Huang, Guosheng Zhu, He Wang, Huawenyu Lu, Jianing Wen, Jianpeng Jiao, Lishu Luo, Longxiang Liu, Sijin Wu, Xiaolei Zhu, Xuanliang Zhang, Yu Liu, Ge Zhang, Yi Lin, Guang Shi, Chaoyou Fu, Wenhao Huang

TL;DR

MME-CC presents a vision-grounded cognitive-bias benchmark organizing 11 subtasks into spatial, geometric, and visual knowledge reasoning to diagnose cognitive capacity in multimodal LLMs. It executes a meticulous data-construction pipeline with human-in-the-loop QC to produce 1,173 questions across 16 models, evaluated via an LLM-based judge and manual cross-checks. Key findings show closed-source models outperform open-source overall (e.g., Gemini-2.5-Pro $=42.66$ vs GLM-4.5V $=30.45$) while spatial and geometric reasoning remain weak (≤$30\%$), and reveal systematic error patterns and CoT dynamics that hinge on visual extraction. The work advances evaluation and design by highlighting the central role of visual cognition, reducing textual shortcuts, and offering diagnostics to guide future model training and architectures toward cognitively grounded visual reasoning.

Abstract

As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information: spatial, geometric, and knowledge-based reasoning, and provides fine-grained analyses of MLLMs' cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs. 30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak (less than or equal to 30%). We further identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions, and observe that Chain-of-Thought typically follows a three-stage process (extract -> reason -> verify) with heavy reliance on visual extraction. We hope this work catalyzes a shift toward treating the cognitive capacity of MLLMs as central to both evaluation and model design.

MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

TL;DR

MME-CC presents a vision-grounded cognitive-bias benchmark organizing 11 subtasks into spatial, geometric, and visual knowledge reasoning to diagnose cognitive capacity in multimodal LLMs. It executes a meticulous data-construction pipeline with human-in-the-loop QC to produce 1,173 questions across 16 models, evaluated via an LLM-based judge and manual cross-checks. Key findings show closed-source models outperform open-source overall (e.g., Gemini-2.5-Pro vs GLM-4.5V ) while spatial and geometric reasoning remain weak (≤), and reveal systematic error patterns and CoT dynamics that hinge on visual extraction. The work advances evaluation and design by highlighting the central role of visual cognition, reducing textual shortcuts, and offering diagnostics to guide future model training and architectures toward cognitively grounded visual reasoning.

Abstract

As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information: spatial, geometric, and knowledge-based reasoning, and provides fine-grained analyses of MLLMs' cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs. 30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak (less than or equal to 30%). We further identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions, and observe that Chain-of-Thought typically follows a three-stage process (extract -> reason -> verify) with heavy reliance on visual extraction. We hope this work catalyzes a shift toward treating the cognitive capacity of MLLMs as central to both evaluation and model design.

Paper Structure

This paper contains 77 sections, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Task taxonomy of MME-CC. Three major task categories are defined—Spatial Reasoning, Geometric Reasoning, and Visual Knowledge Reasoning—each with representative subtasks and one illustrative input example.
  • Figure 2: Overview of the data construction and quality control pipeline. The pipeline comprises four stages: (1) Task definition and preliminary evaluation — subtasks are defined with clear objectives, and small-scale pilots are conducted to validate prompt design and calibrate difficulty; (2) Data acquisition and manual verification — images from license-compliant sources are annotated and cross-checked to ensure quality; (3) Post-processing — standardized procedures (e.g., cropping, resolution checks, identifier assignment) are applied to unify formatting; (4) Model-based filtering — items that are overly simple, redundant, or ambiguous are removed based on MLLM performance, and the remaining samples form the final benchmark.
  • Figure 3: Detailed CoT analysis of Doubao-Seed-1.6-vision-0815 on the Satellite Image Matching task. The analysis reveals three key findings: (1) hierarchical reasoning with distinct phases, (2) continuous and task-dependent visual extraction, and (3) frequent self-interruptions that reduce reasoning efficiency.
  • Figure 4: Representative error cases of Doubao-Seed-1.6-vision-0815.
  • Figure 5: Error Case in Satellite Image Matching
  • ...and 9 more figures