Large Language Models Meet Text-Centric Multimodal Sentiment Analysis: A Survey

Hao Yang; Yanyan Zhao; Yang Wu; Shilong Wang; Tian Zheng; Hongbo Zhang; Zongyang Ma; Wanxiang Che; Bing Qin

Large Language Models Meet Text-Centric Multimodal Sentiment Analysis: A Survey

Hao Yang, Yanyan Zhao, Yang Wu, Shilong Wang, Tian Zheng, Hongbo Zhang, Zongyang Ma, Wanxiang Che, Bing Qin

TL;DR

This survey addresses how large language models (LLMs) and large multimodal models (LMMs) can be leveraged for text-centric multimodal sentiment analysis, focusing on image-text and audio-image-text tasks. It provides a taxonomy of tasks, datasets, and methods, including cross-modal alignment and fusion approaches, and surveys how LLMs/LMMs are used via prompting, instruction tuning, or fine-tuning. The paper reviews evaluation practices (prompt strategies and metrics) and summarizes reference results across benchmarks, highlighting practical considerations such as hallucinations and prompt sensitivity. By outlining applications and challenges, the survey points to future directions in multilingual, knowledge-augmented, and efficient multimodal sentiment analysis.

Abstract

Compared to traditional sentiment analysis, which only considers text, multimodal sentiment analysis needs to consider emotional signals from multimodal sources simultaneously and is therefore more consistent with the way how humans process sentiment in real-world scenarios. It involves processing emotional information from various sources such as natural language, images, videos, audio, physiological signals, etc. However, although other modalities also contain diverse emotional cues, natural language usually contains richer contextual information and therefore always occupies a crucial position in multimodal sentiment analysis. The emergence of ChatGPT has opened up immense potential for applying large language models (LLMs) to text-centric multimodal tasks. However, it is still unclear how existing LLMs can adapt better to text-centric multimodal sentiment analysis tasks. This survey aims to (1) present a comprehensive review of recent research in text-centric multimodal sentiment analysis tasks, (2) examine the potential of LLMs for text-centric multimodal sentiment analysis, outlining their approaches, advantages, and limitations, (3) summarize the application scenarios of LLM-based multimodal sentiment analysis technology, and (4) explore the challenges and potential research directions for multimodal sentiment analysis in the future.

Large Language Models Meet Text-Centric Multimodal Sentiment Analysis: A Survey

TL;DR

Abstract

Paper Structure (24 sections, 19 equations, 13 figures, 4 tables)

This paper contains 24 sections, 19 equations, 13 figures, 4 tables.

Introduction
Background on Large Language Models
Large Language Models
Large Multimodal Models
Usage of Large Language Models
Text-Centric Multimodal Sentiment Analysis Tasks
Basic Concepts of Multimodal Sentiment Analysis
Image-Text Sentiment Analysis
Coarse-grained Level
Fine-grained Level
Audio-Image-Text Sentiment Analysis
Cross-modal Sentiment Semantic Alignment
Multimodal Sentiment Semantic Fusion
Audio-image-text Sentiment Analysis Datasets
Multimodal Sarcasm Detection
...and 9 more sections

Figures (13)

Figure 1: Organization of the review article.
Figure 2: Overview of WisdoM framework 171 architecture for coarse-grained image-text sentimen.
Figure 3: Image-text fine-grained sentiment analysis tasks.
Figure 4: Illustration of exact boundary and fuzzy boundary.
Figure 5: Overview of DQPSA model 151 architecture.
...and 8 more figures

Large Language Models Meet Text-Centric Multimodal Sentiment Analysis: A Survey

TL;DR

Abstract

Large Language Models Meet Text-Centric Multimodal Sentiment Analysis: A Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (13)