An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLMs' Sentimental Perception Capability
Daiqing Wu, Dongbao Yang, Sicheng Zhao, Can Ma, Yu Zhou
TL;DR
This study shows that Multimodal Large Language Models can be effectively steered to perceive sentiment through carefully configured In-Context Learning demonstrations. By optimizing three factors—demonstration retrieval via similarity, presentation of multimodal evidence, and sentiment distribution in the demonstrations—the authors achieve substantial accuracy gains on six MSA datasets compared with zero-shot and random ICL baselines. They reveal a sentimental predictive bias in MLLMs and demonstrate practical mitigation through distribution strategies, yielding improvements up to roughly 16 percentage points over zero-shot. The work provides a practical, model-agnostic blueprint for deploying multimodal ICL in sentiment tasks and lays groundwork for broader adoption and refinement of multimodal ICL in real-world settings.
Abstract
The advancements in Multimodal Large Language Models (MLLMs) have enabled various multimodal tasks to be addressed under a zero-shot paradigm. This paradigm sidesteps the cost of model fine-tuning, emerging as a dominant trend in practical application. Nevertheless, Multimodal Sentiment Analysis (MSA), a pivotal challenge in the quest for general artificial intelligence, fails to accommodate this convenience. The zero-shot paradigm exhibits undesirable performance on MSA, casting doubt on whether MLLMs can perceive sentiments as competent as supervised models. By extending the zero-shot paradigm to In-Context Learning (ICL) and conducting an in-depth study on configuring demonstrations, we validate that MLLMs indeed possess such capability. Specifically, three key factors that cover demonstrations' retrieval, presentation, and distribution are comprehensively investigated and optimized. A sentimental predictive bias inherent in MLLMs is also discovered and later effectively counteracted. By complementing each other, the devised strategies for three factors result in average accuracy improvements of 15.9% on six MSA datasets against the zero-shot paradigm and 11.2% against the random ICL baseline.
