Table of Contents
Fetching ...

Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

Jen-Tse Huang, Dasen Dai, Jen-Yuan Huang, Youliang Yuan, Xiaoyuan Liu, Wenxuan Wang, Wenxiang Jiao, Pinjia He, Zhaopeng Tu, Haodong Duan

TL;DR

VisFactor is the first psychometrically grounded benchmark that redefines MLLM visual cognition evaluation by translating 20 FRCT subtests into a vision–language setting. The benchmark employs a controllable-difficulty generator to create unlimited items, reducing luck-based scoring and enabling scalable tracking of progress. Across 23 frontier models, the best result is only $30.17\%$, with broad failures in mental rotation, spatial reasoning, and figure–ground tasks, signaling a gap between current multimodal pretraining outcomes and human-like visuocognition. Human performance remains substantially higher (~$78.8\%$) on VisFactor, underscoring the need for curriculum-style training that grounds perception in low-level visual faculties intertwined with higher-level reasoning, rather than solely optimizing downstream tasks.

Abstract

Humans develop perception through a bottom-up hierarchy: from basic primitives and Gestalt principles to high-level semantics. In contrast, current Multimodal Large Language Models (MLLMs) are trained directly on complex downstream tasks, often bypassing these foundational visual capabilities. To systematically investigate this gap, we introduce VisFactor, a benchmark that digitizes 20 vision-centric subtests from FRCT, a well-established cognitive psychology assessment spanning four domains of human visual cognition. Furthermore, we design algorithms to automatically construct and validate unlimited test cases with controllable difficulty. Using VisFactor, we evaluate 23 frontier MLLMs, including both proprietary (e.g., GPT, Gemini) and open-source (e.g., LLaMA, Qwen) models. The best model achieves a score of only 30.17%. Models consistently fail on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination, regardless of model size or prompting strategy. These findings suggest that performance improvements on existing general benchmarks might represent castles in the air instead of a genuine mastery of human-like visual cognition.

Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

TL;DR

VisFactor is the first psychometrically grounded benchmark that redefines MLLM visual cognition evaluation by translating 20 FRCT subtests into a vision–language setting. The benchmark employs a controllable-difficulty generator to create unlimited items, reducing luck-based scoring and enabling scalable tracking of progress. Across 23 frontier models, the best result is only , with broad failures in mental rotation, spatial reasoning, and figure–ground tasks, signaling a gap between current multimodal pretraining outcomes and human-like visuocognition. Human performance remains substantially higher (~) on VisFactor, underscoring the need for curriculum-style training that grounds perception in low-level visual faculties intertwined with higher-level reasoning, rather than solely optimizing downstream tasks.

Abstract

Humans develop perception through a bottom-up hierarchy: from basic primitives and Gestalt principles to high-level semantics. In contrast, current Multimodal Large Language Models (MLLMs) are trained directly on complex downstream tasks, often bypassing these foundational visual capabilities. To systematically investigate this gap, we introduce VisFactor, a benchmark that digitizes 20 vision-centric subtests from FRCT, a well-established cognitive psychology assessment spanning four domains of human visual cognition. Furthermore, we design algorithms to automatically construct and validate unlimited test cases with controllable difficulty. Using VisFactor, we evaluate 23 frontier MLLMs, including both proprietary (e.g., GPT, Gemini) and open-source (e.g., LLaMA, Qwen) models. The best model achieves a score of only 30.17%. Models consistently fail on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination, regardless of model size or prompting strategy. These findings suggest that performance improvements on existing general benchmarks might represent castles in the air instead of a genuine mastery of human-like visual cognition.

Paper Structure

This paper contains 61 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: VisFactor comprises 20 vision-centric cognitive subtests. Each task is designed to isolate core factors of human visual cognition, covering 10 distinct factors in total. The subtests are converted into either yes/no questions or fill-in-the-blank questions according to §\ref{['sec:variants']}. Example stimuli, questions, and ground-truth answers are shown for each task.
  • Figure 2: Samples of our generated images. We can dynamically adjust test difficulties in VisFactor. For example, the grid size of CF3 is changed to $6 \times 6$ instead of $5 \times 5$.
  • Figure 3: An example of our generated MA1 image-number pairs using CF2 and MV1 figures.
  • Figure 4: VisFactor integrates 20 subtests adapted from standardized human cognitive assessments. Subtests are organized into four major domains and weighted by test case count (shown numerically), which determines each segment' visual area.
  • Figure 5: Perason correlation between all subtests in VisFactor.