Table of Contents
Fetching ...

Multimodal LLMs See Sentiment

Neemias B. da Silva, John Harrison, Rodrigo Minetto, Myriam R. Delgado, Bogdan T. Nassu, Thiago H. Silva

TL;DR

This work tackles visual sentiment analysis by leveraging Multimodal Large Language Models (MLLMs) in a three-path framework: direct image sentiment classification with MLLMs, caption-based sentiment via pre-trained LLMs on image descriptions, and fine-tuned LLM sentiment on those descriptions. The proposed MLLMsent system demonstrates state-of-the-art performance on PerceptSent, with strong cross-dataset generalization to DeepSent, and the largest gains arise from fine-tuning the LLMs. The approach emphasizes interpretability through generated textual descriptions that explain predictions, offering a transparent pathway from visual input to sentiment labels. Overall, the study highlights the effectiveness of multimodal reasoning for affective computing on social-media data and provides benchmarks and insights for future research.

Abstract

Understanding how visual content communicates sentiment is critical in an era where online interaction is increasingly dominated by this kind of media on social platforms. However, this remains a challenging problem, as sentiment perception is closely tied to complex, scene-level semantics. In this paper, we propose an original framework, MLLMsent, to investigate the sentiment reasoning capabilities of Multimodal Large Language Models (MLLMs) through three perspectives: (1) using those MLLMs for direct sentiment classification from images; (2) associating them with pre-trained LLMs for sentiment analysis on automatically generated image descriptions; and (3) fine-tuning the LLMs on sentiment-labeled image descriptions. Experiments on a recent and established benchmark demonstrate that our proposal, particularly the fine-tuned approach, achieves state-of-the-art results outperforming Lexicon-, CNN-, and Transformer-based baselines by up to 30.9%, 64.8%, and 42.4%, respectively, across different levels of evaluators' agreement and sentiment polarity categories. Remarkably, in a cross-dataset test, without any training on these new data, our model still outperforms, by up to 8.26%, the best runner-up, which has been trained directly on them. These results highlight the potential of the proposed visual reasoning scheme for advancing affective computing, while also establishing new benchmarks for future research.

Multimodal LLMs See Sentiment

TL;DR

This work tackles visual sentiment analysis by leveraging Multimodal Large Language Models (MLLMs) in a three-path framework: direct image sentiment classification with MLLMs, caption-based sentiment via pre-trained LLMs on image descriptions, and fine-tuned LLM sentiment on those descriptions. The proposed MLLMsent system demonstrates state-of-the-art performance on PerceptSent, with strong cross-dataset generalization to DeepSent, and the largest gains arise from fine-tuning the LLMs. The approach emphasizes interpretability through generated textual descriptions that explain predictions, offering a transparent pathway from visual input to sentiment labels. Overall, the study highlights the effectiveness of multimodal reasoning for affective computing on social-media data and provides benchmarks and insights for future research.

Abstract

Understanding how visual content communicates sentiment is critical in an era where online interaction is increasingly dominated by this kind of media on social platforms. However, this remains a challenging problem, as sentiment perception is closely tied to complex, scene-level semantics. In this paper, we propose an original framework, MLLMsent, to investigate the sentiment reasoning capabilities of Multimodal Large Language Models (MLLMs) through three perspectives: (1) using those MLLMs for direct sentiment classification from images; (2) associating them with pre-trained LLMs for sentiment analysis on automatically generated image descriptions; and (3) fine-tuning the LLMs on sentiment-labeled image descriptions. Experiments on a recent and established benchmark demonstrate that our proposal, particularly the fine-tuned approach, achieves state-of-the-art results outperforming Lexicon-, CNN-, and Transformer-based baselines by up to 30.9%, 64.8%, and 42.4%, respectively, across different levels of evaluators' agreement and sentiment polarity categories. Remarkably, in a cross-dataset test, without any training on these new data, our model still outperforms, by up to 8.26%, the best runner-up, which has been trained directly on them. These results highlight the potential of the proposed visual reasoning scheme for advancing affective computing, while also establishing new benchmarks for future research.

Paper Structure

This paper contains 14 sections, 1 equation, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Image samples and associated data from (a) PerceptSent and (b) DeepSent datasets, with the corresponding sentiment polarity vector ($\mathbf{s}$) assigned by human evaluators. Image source: PerceptSentyou2015robust.
  • Figure 2: Architecture diagram of MLLMsent, our proposed Multimodal Large Language Model framework for sentiment analysis.
  • Figure 3: Evaluator's votes and ground-truth labels for selected images from the PerceptSent dataset, along with classification outcomes from MLLMs prompted for direct sentiment prediction (Task 1) under distinct problem setups and dominance thresholds.
  • Figure 4: Visual Reasoning for PerceptSent images - MLLMs descriptions (refer to Fig. \ref{['fig:outcome-Task1']} for evaluator votes and target sentiment labels).
  • Figure 5: Qualitative comparison of sentiment predictions on PerceptSent image examples considering different combinations of MLLMs and LLMs for Task 2a and Task 2b, under different problem setups and evaluator's agreement.
  • ...and 4 more figures