Enhanced Multimodal Aspect-Based Sentiment Analysis by LLM-Generated Rationales
Jun Cao, Jiyi Li, Ziwei Yang, Renjie Zhou
TL;DR
The paper tackles fine-grained multimodal sentiment analysis (MABSA) by addressing the limited cross-modal understanding of small language models (SLMs) and leveraging the knowledge-rich rationales generated by large language models (LLMs). It introduces LRSA, an encoder-decoder framework that injects LLM-generated rationales into SLMs and fuses them through a dual cross-attention module to better connect image regions and textual cues with sentiment. Empirical results on Twitter2015 and Twitter2017 benchmarks show LRSA consistently outperforms both text-only ABSA baselines and prior multimodal approaches across MABSA, MATE, and MASC tasks, with ablation studies confirming the critical role of the dual cross-attention and rationale integration. The work demonstrates that incorporating targeted LLM rationales can substantially enhance fine-grained multimodal reasoning, suggesting practical paths for improving MABSA systems across diverse pre-trained backbones.
Abstract
There has been growing interest in Multimodal Aspect-Based Sentiment Analysis (MABSA) in recent years. Existing methods predominantly rely on pre-trained small language models (SLMs) to collect information related to aspects and sentiments from both image and text, with an aim to align these two modalities. However, small SLMs possess limited capacity and knowledge, often resulting in inaccurate identification of meaning, aspects, sentiments, and their interconnections in textual and visual data. On the other hand, Large language models (LLMs) have shown exceptional capabilities in various tasks by effectively exploring fine-grained information in multimodal data. However, some studies indicate that LLMs still fall short compared to fine-tuned small models in the field of ABSA. Based on these findings, we propose a novel framework, termed LRSA, which combines the decision-making capabilities of SLMs with additional information provided by LLMs for MABSA. Specifically, we inject explanations generated by LLMs as rationales into SLMs and employ a dual cross-attention mechanism for enhancing feature interaction and fusion, thereby augmenting the SLMs' ability to identify aspects and sentiments. We evaluated our method using two baseline models, numerous experiments highlight the superiority of our approach on three widely-used benchmarks, indicating its generalizability and applicability to most pre-trained models for MABSA.
