Table of Contents
Fetching ...

Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

Yinghui Li, Jiayi Kuang, Peng Xing, Daixian Liu, Junnan Dong, Shu-Yu Guo, Yangning Li, Qingyu Zhou, Wenhao Jiang, Hai-Tao Zheng, Ying Shen, Liang Lin, Philip S. Yu

Abstract

While Multimodal Large Language Models (MLLMs) have achieved remarkable success in interpreting natural scenes, their ability to process discrete symbols -- the fundamental building blocks of human cognition -- remains a critical open question. Unlike continuous visual data, symbols such as mathematical formulas, chemical structures, and linguistic characters require precise, deeper interpretation. This paper introduces a comprehensive benchmark to evaluate how top-tier MLLMs navigate these "discrete semantic spaces" across five domains: language, culture, mathematics, physics, and chemistry. Our investigation uncovers a counterintuitive phenomenon: models often fail at basic symbol recognition yet succeed in complex reasoning tasks, suggesting they rely on linguistic probability rather than true visual perception. By exposing this "cognitive mismatch", we highlight a significant gap in current AI capabilities: the struggle to truly perceive and understand the symbolic languages that underpin scientific discovery and abstract thought. This work offers a roadmap for developing more rigorous, human-aligned intelligent systems.

Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

Abstract

While Multimodal Large Language Models (MLLMs) have achieved remarkable success in interpreting natural scenes, their ability to process discrete symbols -- the fundamental building blocks of human cognition -- remains a critical open question. Unlike continuous visual data, symbols such as mathematical formulas, chemical structures, and linguistic characters require precise, deeper interpretation. This paper introduces a comprehensive benchmark to evaluate how top-tier MLLMs navigate these "discrete semantic spaces" across five domains: language, culture, mathematics, physics, and chemistry. Our investigation uncovers a counterintuitive phenomenon: models often fail at basic symbol recognition yet succeed in complex reasoning tasks, suggesting they rely on linguistic probability rather than true visual perception. By exposing this "cognitive mismatch", we highlight a significant gap in current AI capabilities: the struggle to truly perceive and understand the symbolic languages that underpin scientific discovery and abstract thought. This work offers a roadmap for developing more rigorous, human-aligned intelligent systems.
Paper Structure (50 sections, 10 equations, 14 figures, 1 table)

This paper contains 50 sections, 10 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: (a) The difference between continuous and discrete semantic spaces. (b) The visual mechanism of the human eye. (c) The neural response mechanism of the human brain to different levels of visual symbols.
  • Figure 2: Overview of the benchmark task design framework and illustrative examples. (a) The instantiation of the three-level symbolic understanding hierarchy across five distinct domains: Language, Cultural, Mathematical, Physical, and Chemical symbols. (b) Representative examples of the tasks designed for our benchmark.
  • Figure 3: (a) Radar charts illustrate the fine-grained performance capabilities of models across General, Language, Culture, Math, Physics, and Chemistry domains. (b) Global performance aggregated by difficulty, with accuracy averaged across all five symbolic domains. (c) Scatter plots exploring the interrelationships between different domains.
  • Figure 4: Fine-grained performance analysis of MLLMs on language symbol tasks across difficulty levels. (a-b) Performance breakdown for Level 1 and Level 2 tasks, reporting F1-score and Prediction Count. (c) Bidirectional plot for Level 3 tasks, contrasting Exact Match accuracy (left, $\uparrow$) with Edit Distance error rates (right, $\downarrow$). (d) A normalized heatmap summary of all metrics, where darker orange indicates superior performance. (e) Case illustrations of language symbols.
  • Figure 5: Hierarchical evaluation of MLLMs on cultural symbol tasks. (a) A vertical bar chart showing F1-scores at the word level. (b) A vertical bar chart showing F1-scores at the sentence level. (c) A paired grouped bar chart shows performance on 4-word (left) and multi-word (right) idioms, with bars color-coded by metric granularity (Word / Char-2 / Char-1). (d) Radial summary: A radial stacked chart aggregates key metrics across all levels, where the radius represents total capability. (e) Case illustrations of cultural symbols.
  • ...and 9 more figures