Table of Contents
Fetching ...

From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models

Chenyue Zhou, Mingxuan Wang, Yanbiao Ma, Chenxu Wu, Wanyi Chen, Zhe Qian, Xinyu Liu, Yiwei Zhang, Junhao Wang, Hengbo Xu, Fei Luo, Xiaohua Chen, Xiaoshuai Hao, Hehan Li, Andi Zhang, Wenxuan Wang, Kaiyan Zhang, Guoli Jia, Lingling Li, Zhiwu Lu, Yang Lu, Yike Guo

TL;DR

The paper addresses the perception–cognition gap in Multimodal Large Language Models (MLLMs) by proposing the From Perception to Cognition framework that separates low-level visual perception from high-level cognitive reasoning. It provides a structured taxonomy of methods targeting perception (visual encoders and alignment) and cognition (decomposition, dynamic reasoning) and analyzes bottlenecks including hallucinations. The survey covers benchmarks and applications across scientific problem solving, medicine, diagrams, video understanding, and sentiment analysis, and discusses auxiliary directions such as latent reasoning and tool-augmented reasoning. It concludes with future directions toward a truly unified visual encoder, latent and generative reasoning, cross-image relation reasoning, and real-world cognitive evaluation to close the perception–cognition gap.

Abstract

Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world, but often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition). This disconnect leads to a spectrum of reasoning failures, with hallucination being the most prominent. Collectively, these issues expose a fundamental challenge: the ability to process pixels does not yet confer the ability to construct a coherent, credible internal world model. To systematically dissect and address this challenge, this survey introduces a novel and unified analytical framework: ``From Perception to Cognition." We deconstruct the complex process of vision-language interactive understanding into two interdependent layers: Perception, the foundational ability to accurately extract visual information and achieve fine-grained alignment with textual instructions; and Cognition, the higher-order capability for proactive, multi-step, goal-oriented reasoning built upon this perceptual foundation, the core of which is the formation of a dynamic observe-think-verify reasoning loop. Guided by this framework, this paper systematically analyzes the key bottlenecks of current MLLMs at both layers. It surveys the landscape of cutting-edge methods designed to address these challenges, spanning from techniques that enhance low-level visual representations to those that improve high-level reasoning paradigms. Furthermore, we review critical benchmarks and delineate future research directions. This survey aims to provide the research community with a clear, structured perspective for understanding the intrinsic limitations of current MLLMs and to illuminate the path toward building next-generation models capable of deep reasoning and a genuine understanding of the world.

From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models

TL;DR

The paper addresses the perception–cognition gap in Multimodal Large Language Models (MLLMs) by proposing the From Perception to Cognition framework that separates low-level visual perception from high-level cognitive reasoning. It provides a structured taxonomy of methods targeting perception (visual encoders and alignment) and cognition (decomposition, dynamic reasoning) and analyzes bottlenecks including hallucinations. The survey covers benchmarks and applications across scientific problem solving, medicine, diagrams, video understanding, and sentiment analysis, and discusses auxiliary directions such as latent reasoning and tool-augmented reasoning. It concludes with future directions toward a truly unified visual encoder, latent and generative reasoning, cross-image relation reasoning, and real-world cognitive evaluation to close the perception–cognition gap.

Abstract

Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world, but often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition). This disconnect leads to a spectrum of reasoning failures, with hallucination being the most prominent. Collectively, these issues expose a fundamental challenge: the ability to process pixels does not yet confer the ability to construct a coherent, credible internal world model. To systematically dissect and address this challenge, this survey introduces a novel and unified analytical framework: ``From Perception to Cognition." We deconstruct the complex process of vision-language interactive understanding into two interdependent layers: Perception, the foundational ability to accurately extract visual information and achieve fine-grained alignment with textual instructions; and Cognition, the higher-order capability for proactive, multi-step, goal-oriented reasoning built upon this perceptual foundation, the core of which is the formation of a dynamic observe-think-verify reasoning loop. Guided by this framework, this paper systematically analyzes the key bottlenecks of current MLLMs at both layers. It surveys the landscape of cutting-edge methods designed to address these challenges, spanning from techniques that enhance low-level visual representations to those that improve high-level reasoning paradigms. Furthermore, we review critical benchmarks and delineate future research directions. This survey aims to provide the research community with a clear, structured perspective for understanding the intrinsic limitations of current MLLMs and to illuminate the path toward building next-generation models capable of deep reasoning and a genuine understanding of the world.

Paper Structure

This paper contains 36 sections, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Evolution of representative multimodal large language models from 2021 to 2025 organized along Perception and Cognition.
  • Figure 2: The overview of the survey structure.
  • Figure 3: Overview of the perception–cognition loop. Perceptual modules extract semantic and spatial evidence which are aligned by a multimodal LLM. Cognition then executes a plan–observe–reason cycle with iterative verification to ground each step in visual evidence.
  • Figure 4: Left: Connector-based MLLM: The typical architecture of traditional multimodal models (e.g., LLaVA), where the connector is usually an MLP that projects visual features into the same dimensional space as text embeddings. Right: Structured embedding alignment in Ovis: The output of the visual encoder is no longer directly projected through an MLP, but is instead mapped to a visual embedding table.
  • Figure 5: An illustration of cross modal fusion and response generation. In this figure, the prompt encoder improves the instruction encoding paradigm. The segmentation decoder enables the model output the segmentation mask, which enhances the response generation.
  • ...and 9 more figures