A Comprehensive Review of Visual-Textual Sentiment Analysis from Social Media Networks
Israa Khalaf Salman Al-Tameemi, Mohammad-Reza Feizi-Derakhshi, Saeed Pashazadeh, Mohammad Asadpour
TL;DR
This survey addresses the shift from text-only sentiment analysis to multimodal sentiment analysis on social media by focusing on the fusion of visual and textual data. It systematically reviews preprocessing, feature extraction, fusion strategies (rule-based, classification-based, attention-based, and bilinear pooling), and classifier approaches across textual, visual, and joint modalities, with attention to benchmark datasets and evaluation measures. The paper also discusses the main challenges of multimodal SA, including cross-modal heterogeneity, incomplete modalities, and data scarcity, and highlights a broad range of applications from finance to healthcare. Overall, it underscores that multimodal SA can surpass unimodal approaches by leveraging complementary visual and textual cues, while outlining practical directions for future research and cross-disciplinary collaboration.
Abstract
Social media networks have become a significant aspect of people's lives, serving as a platform for their ideas, opinions and emotions. Consequently, automated sentiment analysis (SA) is critical for recognising people's feelings in ways that other information sources cannot. The analysis of these feelings revealed various applications, including brand evaluations, YouTube film reviews and healthcare applications. As social media continues to develop, people post a massive amount of information in different forms, including text, photos, audio and video. Thus, traditional SA algorithms have become limited, as they do not consider the expressiveness of other modalities. By including such characteristics from various material sources, these multimodal data streams provide new opportunities for optimising the expected results beyond text-based SA. Our study focuses on the forefront field of multimodal SA, which examines visual and textual data posted on social media networks. Many people are more likely to utilise this information to express themselves on these platforms. To serve as a resource for academics in this rapidly growing field, we introduce a comprehensive overview of textual and visual SA, including data pre-processing, feature extraction techniques, sentiment benchmark datasets, and the efficacy of multiple classification methodologies suited to each field. We also provide a brief introduction of the most frequently utilised data fusion strategies and a summary of existing research on visual-textual SA. Finally, we highlight the most significant challenges and investigate several important sentiment applications.
