MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

Kaiyuan Zhang; Chenghao Yang; Zhoufutu Wen; Sihang Yuan; Qiuyue Wang; Chaoyi Huang; Guosheng Zhu; He Wang; Huawenyu Lu; Jianing Wen; Jianpeng Jiao; Lishu Luo; Longxiang Liu; Sijin Wu; Xiaolei Zhu; Xuanliang Zhang; Yu Liu; Ge Zhang; Yi Lin; Guang Shi; Chaoyou Fu; Wenhao Huang

MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

Kaiyuan Zhang, Chenghao Yang, Zhoufutu Wen, Sihang Yuan, Qiuyue Wang, Chaoyi Huang, Guosheng Zhu, He Wang, Huawenyu Lu, Jianing Wen, Jianpeng Jiao, Lishu Luo, Longxiang Liu, Sijin Wu, Xiaolei Zhu, Xuanliang Zhang, Yu Liu, Ge Zhang, Yi Lin, Guang Shi, Chaoyou Fu, Wenhao Huang

TL;DR

MME-CC presents a vision-grounded cognitive-bias benchmark organizing 11 subtasks into spatial, geometric, and visual knowledge reasoning to diagnose cognitive capacity in multimodal LLMs. It executes a meticulous data-construction pipeline with human-in-the-loop QC to produce 1,173 questions across 16 models, evaluated via an LLM-based judge and manual cross-checks. Key findings show closed-source models outperform open-source overall (e.g., Gemini-2.5-Pro $=42.66$ vs GLM-4.5V $=30.45$) while spatial and geometric reasoning remain weak (≤$30\%$), and reveal systematic error patterns and CoT dynamics that hinge on visual extraction. The work advances evaluation and design by highlighting the central role of visual cognition, reducing textual shortcuts, and offering diagnostics to guide future model training and architectures toward cognitively grounded visual reasoning.

Abstract

As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information: spatial, geometric, and knowledge-based reasoning, and provides fine-grained analyses of MLLMs' cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs. 30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak (less than or equal to 30%). We further identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions, and observe that Chain-of-Thought typically follows a three-stage process (extract -> reason -> verify) with heavy reliance on visual extraction. We hope this work catalyzes a shift toward treating the cognitive capacity of MLLMs as central to both evaluation and model design.

MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

TL;DR

Abstract

MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)