Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

Jen-Tse Huang; Dasen Dai; Jen-Yuan Huang; Youliang Yuan; Xiaoyuan Liu; Wenxuan Wang; Wenxiang Jiao; Pinjia He; Zhaopeng Tu; Haodong Duan

Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

Jen-Tse Huang, Dasen Dai, Jen-Yuan Huang, Youliang Yuan, Xiaoyuan Liu, Wenxuan Wang, Wenxiang Jiao, Pinjia He, Zhaopeng Tu, Haodong Duan

TL;DR

VisFactor is the first psychometrically grounded benchmark that redefines MLLM visual cognition evaluation by translating 20 FRCT subtests into a vision–language setting. The benchmark employs a controllable-difficulty generator to create unlimited items, reducing luck-based scoring and enabling scalable tracking of progress. Across 23 frontier models, the best result is only $30.17\%$, with broad failures in mental rotation, spatial reasoning, and figure–ground tasks, signaling a gap between current multimodal pretraining outcomes and human-like visuocognition. Human performance remains substantially higher (~$78.8\%$) on VisFactor, underscoring the need for curriculum-style training that grounds perception in low-level visual faculties intertwined with higher-level reasoning, rather than solely optimizing downstream tasks.

Abstract

Humans develop perception through a bottom-up hierarchy: from basic primitives and Gestalt principles to high-level semantics. In contrast, current Multimodal Large Language Models (MLLMs) are trained directly on complex downstream tasks, often bypassing these foundational visual capabilities. To systematically investigate this gap, we introduce VisFactor, a benchmark that digitizes 20 vision-centric subtests from FRCT, a well-established cognitive psychology assessment spanning four domains of human visual cognition. Furthermore, we design algorithms to automatically construct and validate unlimited test cases with controllable difficulty. Using VisFactor, we evaluate 23 frontier MLLMs, including both proprietary (e.g., GPT, Gemini) and open-source (e.g., LLaMA, Qwen) models. The best model achieves a score of only 30.17%. Models consistently fail on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination, regardless of model size or prompting strategy. These findings suggest that performance improvements on existing general benchmarks might represent castles in the air instead of a genuine mastery of human-like visual cognition.

Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

TL;DR

Abstract

Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)