Table of Contents
Fetching ...

Core Knowledge Deficits in Multi-Modal Language Models

Yijiang Li, Qingying Gao, Tianwei Zhao, Bingyang Wang, Haoran Sun, Haiyun Lyu, Robert D. Hawkins, Nuno Vasconcelos, Tal Golan, Dezhi Luo, Hokin Deng

TL;DR

This work introduces CoreCognition, a developmentally inspired, large-scale benchmark of 12 core cognitive abilities to probe grounding in multi-modal large language models. Through 230 models and 2,530 data points across 11 prompts, the study reveals persistent core knowledge deficits, weak cross-ability dependencies, and limited scaling benefits for low-level tasks. It further shows that core abilities predict higher-level performance, yet reasoning and scaling do not reliably close the gap, indicating models fail to acquire robust core knowledge. The Concept Hacking methodology demonstrates that many improvements arise from shortcut learning rather than genuine understanding, underscoring the need for cognition-grounded training and evaluation to advance robust, human-like AI systems.

Abstract

While Multi-modal Large Language Models (MLLMs) demonstrate impressive abilities over high-level perception and reasoning, their robustness in the wild remains limited, often falling short on tasks that are intuitive and effortless for humans. We examine the hypothesis that these deficiencies stem from the absence of core knowledge--rudimentary cognitive abilities innate to humans from early childhood. To explore the core knowledge representation in MLLMs, we introduce CoreCognition, a large-scale benchmark encompassing 12 core knowledge concepts grounded in developmental cognitive science. We evaluate 230 models with 11 different prompts, leading to a total of 2,530 data points for analysis. Our experiments uncover four key findings, collectively demonstrating core knowledge deficits in MLLMs: they consistently underperform and show reduced, or even absent, scalability on low-level abilities relative to high-level ones. Finally, we propose Concept Hacking, a novel controlled evaluation method that reveals MLLMs fail to progress toward genuine core knowledge understanding, but instead rely on shortcut learning as they scale.

Core Knowledge Deficits in Multi-Modal Language Models

TL;DR

This work introduces CoreCognition, a developmentally inspired, large-scale benchmark of 12 core cognitive abilities to probe grounding in multi-modal large language models. Through 230 models and 2,530 data points across 11 prompts, the study reveals persistent core knowledge deficits, weak cross-ability dependencies, and limited scaling benefits for low-level tasks. It further shows that core abilities predict higher-level performance, yet reasoning and scaling do not reliably close the gap, indicating models fail to acquire robust core knowledge. The Concept Hacking methodology demonstrates that many improvements arise from shortcut learning rather than genuine understanding, underscoring the need for cognition-grounded training and evaluation to advance robust, human-like AI systems.

Abstract

While Multi-modal Large Language Models (MLLMs) demonstrate impressive abilities over high-level perception and reasoning, their robustness in the wild remains limited, often falling short on tasks that are intuitive and effortless for humans. We examine the hypothesis that these deficiencies stem from the absence of core knowledge--rudimentary cognitive abilities innate to humans from early childhood. To explore the core knowledge representation in MLLMs, we introduce CoreCognition, a large-scale benchmark encompassing 12 core knowledge concepts grounded in developmental cognitive science. We evaluate 230 models with 11 different prompts, leading to a total of 2,530 data points for analysis. Our experiments uncover four key findings, collectively demonstrating core knowledge deficits in MLLMs: they consistently underperform and show reduced, or even absent, scalability on low-level abilities relative to high-level ones. Finally, we propose Concept Hacking, a novel controlled evaluation method that reveals MLLMs fail to progress toward genuine core knowledge understanding, but instead rely on shortcut learning as they scale.

Paper Structure

This paper contains 42 sections, 1 equation, 23 figures, 6 tables.

Figures (23)

  • Figure 1: Left. Statistics of the CoreCognition benchmark. Right. Construction of taxonomy. Dependencies between abilities are indicated with arrows.
  • Figure 2: Examples from our CoreCognition benchmark.
  • Figure 3: Overview of the benchmark curation process. Core concept is first operationalized into prototypes, which are then instantiated by annotators as diverse data samples. These data samples finally undergo a strict quality check following our criteria.
  • Figure 4: Accuracy by concept normalized by chance level. Evidence of core knowledge of deficits, with statistical significance.
  • Figure 5: Pearson Correlations Between Core Abilities.
  • ...and 18 more figures