A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks

Jiaqi Wang; Hanqi Jiang; Yiheng Liu; Chong Ma; Xu Zhang; Yi Pan; Mengyuan Liu; Peiran Gu; Sichen Xia; Wenjun Li; Yutong Zhang; Zihao Wu; Zhengliang Liu; Tianyang Zhong; Bao Ge; Tuo Zhang; Ning Qiang; Xintao Hu; Xi Jiang; Xin Zhang; Wei Zhang; Dinggang Shen; Tianming Liu; Shu Zhang

A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks

Jiaqi Wang, Hanqi Jiang, Yiheng Liu, Chong Ma, Xu Zhang, Yi Pan, Mengyuan Liu, Peiran Gu, Sichen Xia, Wenjun Li, Yutong Zhang, Zihao Wu, Zhengliang Liu, Tianyang Zhong, Bao Ge, Tuo Zhang, Ning Qiang, Xintao Hu, Xi Jiang, Xin Zhang, Wei Zhang, Dinggang Shen, Tianming Liu, Shu Zhang

TL;DR

<3-5 sentence high-level summary> Addresses the need to unify processing across text, image, video, audio, and sequential data in AI systems. Systematically surveys MLLMs, their architectures, fusion strategies, and task performance. Offers a comparative analysis across image, video, and audio tasks, identifies strengths and limitations, and suggests directions. Proposes future research directions focusing on interpretability, efficiency, specialization vs generality, security, and ethical use.

Abstract

In an era defined by the explosive growth of data and rapid technological advancements, Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence (AI) systems. Designed to seamlessly integrate diverse data types-including text, images, videos, audio, and physiological sequences-MLLMs address the complexities of real-world applications far beyond the capabilities of single-modality systems. In this paper, we systematically sort out the applications of MLLM in multimodal tasks such as natural language, vision, and audio. We also provide a comparative analysis of the focus of different MLLMs in the tasks, and provide insights into the shortcomings of current MLLMs, and suggest potential directions for future research. Through these discussions, this paper hopes to provide valuable insights for the further development and application of MLLM.

A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks

TL;DR

Abstract

Paper Structure (32 sections, 5 figures)

This paper contains 32 sections, 5 figures.

Introduction
Overview of Multimodal Large Language Models
Definitions and Basic Concepts
Main Components of Multimodal Large Language Models
Mutimodal Input Encoder
Feature Fusion Mechanism
Overview of Multimodal Feature in LLMs
Task Classification of Multimodal Large Language Models
Image Tasks
Image Understanding
Image Understanding Based on Traditional Feature Extraction Methods
Application of Deep Learning Technologies in Image Understanding
Multimodal Image Understanding and Cross-Modal Learning
Application of Reinforcement Learning in Image Understanding
Integration of Image Generation and Understanding
...and 17 more sections

Figures (5)

Figure 1: A timeline of representative MLLMs.
Figure 2: Summary of MLLMs on Image Tasks
Figure 3: Summary of MLLMs on Video Understanding.
Figure 4: Summary of MLLMs on Video Generation.
Figure 5: Summary of MLLMs on Audio Tasks.

A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks

TL;DR

Abstract

A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (5)