Table of Contents
Fetching ...

An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLMs' Sentimental Perception Capability

Daiqing Wu, Dongbao Yang, Sicheng Zhao, Can Ma, Yu Zhou

TL;DR

This study shows that Multimodal Large Language Models can be effectively steered to perceive sentiment through carefully configured In-Context Learning demonstrations. By optimizing three factors—demonstration retrieval via similarity, presentation of multimodal evidence, and sentiment distribution in the demonstrations—the authors achieve substantial accuracy gains on six MSA datasets compared with zero-shot and random ICL baselines. They reveal a sentimental predictive bias in MLLMs and demonstrate practical mitigation through distribution strategies, yielding improvements up to roughly 16 percentage points over zero-shot. The work provides a practical, model-agnostic blueprint for deploying multimodal ICL in sentiment tasks and lays groundwork for broader adoption and refinement of multimodal ICL in real-world settings.

Abstract

The advancements in Multimodal Large Language Models (MLLMs) have enabled various multimodal tasks to be addressed under a zero-shot paradigm. This paradigm sidesteps the cost of model fine-tuning, emerging as a dominant trend in practical application. Nevertheless, Multimodal Sentiment Analysis (MSA), a pivotal challenge in the quest for general artificial intelligence, fails to accommodate this convenience. The zero-shot paradigm exhibits undesirable performance on MSA, casting doubt on whether MLLMs can perceive sentiments as competent as supervised models. By extending the zero-shot paradigm to In-Context Learning (ICL) and conducting an in-depth study on configuring demonstrations, we validate that MLLMs indeed possess such capability. Specifically, three key factors that cover demonstrations' retrieval, presentation, and distribution are comprehensively investigated and optimized. A sentimental predictive bias inherent in MLLMs is also discovered and later effectively counteracted. By complementing each other, the devised strategies for three factors result in average accuracy improvements of 15.9% on six MSA datasets against the zero-shot paradigm and 11.2% against the random ICL baseline.

An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLMs' Sentimental Perception Capability

TL;DR

This study shows that Multimodal Large Language Models can be effectively steered to perceive sentiment through carefully configured In-Context Learning demonstrations. By optimizing three factors—demonstration retrieval via similarity, presentation of multimodal evidence, and sentiment distribution in the demonstrations—the authors achieve substantial accuracy gains on six MSA datasets compared with zero-shot and random ICL baselines. They reveal a sentimental predictive bias in MLLMs and demonstrate practical mitigation through distribution strategies, yielding improvements up to roughly 16 percentage points over zero-shot. The work provides a practical, model-agnostic blueprint for deploying multimodal ICL in sentiment tasks and lays groundwork for broader adoption and refinement of multimodal ICL in real-world settings.

Abstract

The advancements in Multimodal Large Language Models (MLLMs) have enabled various multimodal tasks to be addressed under a zero-shot paradigm. This paradigm sidesteps the cost of model fine-tuning, emerging as a dominant trend in practical application. Nevertheless, Multimodal Sentiment Analysis (MSA), a pivotal challenge in the quest for general artificial intelligence, fails to accommodate this convenience. The zero-shot paradigm exhibits undesirable performance on MSA, casting doubt on whether MLLMs can perceive sentiments as competent as supervised models. By extending the zero-shot paradigm to In-Context Learning (ICL) and conducting an in-depth study on configuring demonstrations, we validate that MLLMs indeed possess such capability. Specifically, three key factors that cover demonstrations' retrieval, presentation, and distribution are comprehensively investigated and optimized. A sentimental predictive bias inherent in MLLMs is also discovered and later effectively counteracted. By complementing each other, the devised strategies for three factors result in average accuracy improvements of 15.9% on six MSA datasets against the zero-shot paradigm and 11.2% against the random ICL baseline.

Paper Structure

This paper contains 24 sections, 2 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Comparison of fully-supervised models, few-shot models, and MLLMs based on average accuracy and annotated data requirement across six MSA datasets. The MLLMs' zero-shot paradigm, although avoiding the laborious annotation, exhibits a substantial performance gap compared to fully-supervised models. With proper demonstration configuration, this gap can be notably narrowed by In-Context Learning (ICL).
  • Figure 2: Comparison between MLLMs' zero-shot paradigm and ICL. In addition to the test sample, ICL sequences three demonstrations with inputs and corresponding outputs, facilitating more precise sentiment predictions for MLLMs.
  • Figure 3: Illustration of the three factors to be investigated and optimized, during which we aim to address the following questions. (a). How do we measure the similarity score between multimodal data? (b). How do we decide which modality should be presented in the input? (c). What kind of impact does the sentiment distribution of demonstrations have?
  • Figure 4: Average accuracy across 4,8,16-shot demonstrations retrieved based on the WIT and WITA strategies.
  • Figure 5: Evaluation of ICL's "Task Learning" effect by progressively incorporating modalities into the inputs.
  • ...and 3 more figures