Mind with Eyes: from Language Reasoning to Multimodal Reasoning

Zhiyu Lin; Yifei Gao; Xian Zhao; Yunfan Yang; Jitao Sang

Mind with Eyes: from Language Reasoning to Multimodal Reasoning

Zhiyu Lin, Yifei Gao, Xian Zhao, Yunfan Yang, Jitao Sang

TL;DR

This survey analyzes multimodal reasoning as a progression from language-centric reasoning to collaborative multimodal reasoning, proposing two levels and surveying one-pass and active perception as well as state-update and action-generation in a multimodal context. It surveys datasets, prompts, data construction, training paradigms (SFT, RL), and benchmarks across general, academic, spatial, and logical tasks, plus metrics like Accuracy, Stability, and Efficiency. It discusses challenges in cross-modal alignment, dynamic interaction, and the limitations of current MLLMs' visual generation, and outlines future directions toward omni reasoning and multimodal agents with grounded foundation models. The work aims to guide development of unified multimodal understanding and generation, enabling deeper cross-modal cognition.

Abstract

Language models have recently advanced into the realm of reasoning, yet it is through multimodal reasoning that we can fully unlock the potential to achieve more comprehensive, human-like cognitive capabilities. This survey provides a systematic overview of the recent multimodal reasoning approaches, categorizing them into two levels: language-centric multimodal reasoning and collaborative multimodal reasoning. The former encompasses one-pass visual perception and active visual perception, where vision primarily serves a supporting role in language reasoning. The latter involves action generation and state update within reasoning process, enabling a more dynamic interaction between modalities. Furthermore, we analyze the technical evolution of these methods, discuss their inherent challenges, and introduce key benchmark tasks and evaluation metrics for assessing multimodal reasoning performance. Finally, we provide insights into future research directions from the following two perspectives: (i) from visual-language reasoning to omnimodal reasoning and (ii) from multimodal reasoning to multimodal agents. This survey aims to provide a structured overview that will inspire further advancements in multimodal reasoning research.

Mind with Eyes: from Language Reasoning to Multimodal Reasoning

TL;DR

Abstract

Mind with Eyes: from Language Reasoning to Multimodal Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)