Table of Contents
Fetching ...

Exploring Cognitive and Aesthetic Causality for Multimodal Aspect-Based Sentiment Analysis

Luwei Xiao, Rui Mao, Shuai Zhao, Qika Lin, Yanhao Jia, Liang He, Erik Cambria

TL;DR

This work tackles multimodal aspect-based sentiment classification (MASC) by introducing Chimera, a framework that models cognitive and aesthetic causalities in image-text pairs. It combines patch-token level alignment (via dynamic patch selection and semantic patch calibration), a translation module to generate emotion-laden textual cues, and a rationale-aware learning objective that jointly predicts sentiment and generates semantic and impression rationales using LLM-derived data. Empirical results on Twitter benchmarks show state-of-the-art performance and strong robustness to domain shifts, with ablations confirming the importance of rationale reasoning, linguistic-visual alignment, and object-level descriptions. The approach enhances interpretability and effectiveness in multimodal sentiment analysis and is publicly released to support further research and practical deployment.

Abstract

Multimodal aspect-based sentiment classification (MASC) is an emerging task due to an increase in user-generated multimodal content on social platforms, aimed at predicting sentiment polarity toward specific aspect targets (i.e., entities or attributes explicitly mentioned in text-image pairs). Despite extensive efforts and significant achievements in existing MASC, substantial gaps remain in understanding fine-grained visual content and the cognitive rationales derived from semantic content and impressions (cognitive interpretations of emotions evoked by image content). In this study, we present Chimera: a cognitive and aesthetic sentiment causality understanding framework to derive fine-grained holistic features of aspects and infer the fundamental drivers of sentiment expression from both semantic perspectives and affective-cognitive resonance (the synergistic effect between emotional responses and cognitive interpretations). Specifically, this framework first incorporates visual patch features for patch-word alignment. Meanwhile, it extracts coarse-grained visual features (e.g., overall image representation) and fine-grained visual regions (e.g., aspect-related regions) and translates them into corresponding textual descriptions (e.g., facial, aesthetic). Finally, we leverage the sentimental causes and impressions generated by a large language model (LLM) to enhance the model's awareness of sentimental cues evoked by semantic content and affective-cognitive resonance. Experimental results on standard MASC datasets demonstrate the effectiveness of the proposed model, which also exhibits greater flexibility to MASC compared to LLMs such as GPT-4o. We have publicly released the complete implementation and dataset at https://github.com/Xillv/Chimera

Exploring Cognitive and Aesthetic Causality for Multimodal Aspect-Based Sentiment Analysis

TL;DR

This work tackles multimodal aspect-based sentiment classification (MASC) by introducing Chimera, a framework that models cognitive and aesthetic causalities in image-text pairs. It combines patch-token level alignment (via dynamic patch selection and semantic patch calibration), a translation module to generate emotion-laden textual cues, and a rationale-aware learning objective that jointly predicts sentiment and generates semantic and impression rationales using LLM-derived data. Empirical results on Twitter benchmarks show state-of-the-art performance and strong robustness to domain shifts, with ablations confirming the importance of rationale reasoning, linguistic-visual alignment, and object-level descriptions. The approach enhances interpretability and effectiveness in multimodal sentiment analysis and is publicly released to support further research and practical deployment.

Abstract

Multimodal aspect-based sentiment classification (MASC) is an emerging task due to an increase in user-generated multimodal content on social platforms, aimed at predicting sentiment polarity toward specific aspect targets (i.e., entities or attributes explicitly mentioned in text-image pairs). Despite extensive efforts and significant achievements in existing MASC, substantial gaps remain in understanding fine-grained visual content and the cognitive rationales derived from semantic content and impressions (cognitive interpretations of emotions evoked by image content). In this study, we present Chimera: a cognitive and aesthetic sentiment causality understanding framework to derive fine-grained holistic features of aspects and infer the fundamental drivers of sentiment expression from both semantic perspectives and affective-cognitive resonance (the synergistic effect between emotional responses and cognitive interpretations). Specifically, this framework first incorporates visual patch features for patch-word alignment. Meanwhile, it extracts coarse-grained visual features (e.g., overall image representation) and fine-grained visual regions (e.g., aspect-related regions) and translates them into corresponding textual descriptions (e.g., facial, aesthetic). Finally, we leverage the sentimental causes and impressions generated by a large language model (LLM) to enhance the model's awareness of sentimental cues evoked by semantic content and affective-cognitive resonance. Experimental results on standard MASC datasets demonstrate the effectiveness of the proposed model, which also exhibits greater flexibility to MASC compared to LLMs such as GPT-4o. We have publicly released the complete implementation and dataset at https://github.com/Xillv/Chimera

Paper Structure

This paper contains 31 sections, 24 equations, 6 figures, 7 tables, 2 algorithms.

Figures (6)

  • Figure 1: The overall framework of the proposed Chimera. Chimera consists of four parts: Translation Module, Rationale Dataset Construction, Linguistic-aware Semantic Alignment, and Rationale-Aware Learning.
  • Figure 2: Results ($\%$) on hyper-parameter of $\alpha$ and $\lambda$.
  • Figure 3: Human evaluation of factuality, clarity and fluency for SR and IR.
  • Figure 4: Assessment of sentiment intensity for SR and IR in both ground truth data and Chimera-generated content.
  • Figure 5: Visualization of the top 15 most frequent aesthetic-related words in generated IR.
  • ...and 1 more figures