Table of Contents
Fetching ...

RNG: Reducing Multi-level Noise and Multi-grained Semantic Gap for Joint Multimodal Aspect-Sentiment Analysis

Yaxin Liu, Yan Zhou, Ziming Li, Jinchuan Zhang, Yu Shang, Chenyang Zhang, Songlin Hu

TL;DR

This work addresses Joint Multimodal Aspect-Sentiment Analysis (JMASA) by introducing RNG, an information-theoretic framework that reduces both intra- and inter-modal noise while aligning coarse- and fine-grained semantics. It introduces three constraints—Global Relevance Constraint (GR-Con) for instance-level noise, Information Bottleneck Constraint (IB-Con) for feature-level noise, and Semantic Consistency Constraint (SC-Con) for mutual-information-guided cross-modal alignment via InfoNCE. The model employs RoBERTa and ViT encoders, variational attention to produce latent representations $Z^T$ and $Z^V$, Cross-GAU for inter-modal fusion, and a CRF for sequence labeling, trained with a composite loss $L = L_{task} + L_{IB} + L_{SC}$ including mutual-information terms. Experiments on Twitter-2015 and Twitter-2017 demonstrate state-of-the-art performance and validate the contribution of each constraint through ablation studies. Overall, the approach offers a principled, end-to-end method that leverages information theory to enhance fine-grained multimodal sentiment understanding and extraction.

Abstract

As an important multimodal sentiment analysis task, Joint Multimodal Aspect-Sentiment Analysis (JMASA), aiming to jointly extract aspect terms and their associated sentiment polarities from the given text-image pairs, has gained increasing concerns. Existing works encounter two limitations: (1) multi-level modality noise, i.e., instance- and feature-level noise; and (2) multi-grained semantic gap, i.e., coarse- and fine-grained gap. Both issues may interfere with accurate identification of aspect-sentiment pairs. To address these limitations, we propose a novel framework named RNG for JMASA. Specifically, to simultaneously reduce multi-level modality noise and multi-grained semantic gap, we design three constraints: (1) Global Relevance Constraint (GR-Con) based on text-image similarity for instance-level noise reduction, (2) Information Bottleneck Constraint (IB-Con) based on the Information Bottleneck (IB) principle for feature-level noise reduction, and (3) Semantic Consistency Constraint (SC-Con) based on mutual information maximization in a contrastive learning way for multi-grained semantic gap reduction. Extensive experiments on two datasets validate our new state-of-the-art performance.

RNG: Reducing Multi-level Noise and Multi-grained Semantic Gap for Joint Multimodal Aspect-Sentiment Analysis

TL;DR

This work addresses Joint Multimodal Aspect-Sentiment Analysis (JMASA) by introducing RNG, an information-theoretic framework that reduces both intra- and inter-modal noise while aligning coarse- and fine-grained semantics. It introduces three constraints—Global Relevance Constraint (GR-Con) for instance-level noise, Information Bottleneck Constraint (IB-Con) for feature-level noise, and Semantic Consistency Constraint (SC-Con) for mutual-information-guided cross-modal alignment via InfoNCE. The model employs RoBERTa and ViT encoders, variational attention to produce latent representations and , Cross-GAU for inter-modal fusion, and a CRF for sequence labeling, trained with a composite loss including mutual-information terms. Experiments on Twitter-2015 and Twitter-2017 demonstrate state-of-the-art performance and validate the contribution of each constraint through ablation studies. Overall, the approach offers a principled, end-to-end method that leverages information theory to enhance fine-grained multimodal sentiment understanding and extraction.

Abstract

As an important multimodal sentiment analysis task, Joint Multimodal Aspect-Sentiment Analysis (JMASA), aiming to jointly extract aspect terms and their associated sentiment polarities from the given text-image pairs, has gained increasing concerns. Existing works encounter two limitations: (1) multi-level modality noise, i.e., instance- and feature-level noise; and (2) multi-grained semantic gap, i.e., coarse- and fine-grained gap. Both issues may interfere with accurate identification of aspect-sentiment pairs. To address these limitations, we propose a novel framework named RNG for JMASA. Specifically, to simultaneously reduce multi-level modality noise and multi-grained semantic gap, we design three constraints: (1) Global Relevance Constraint (GR-Con) based on text-image similarity for instance-level noise reduction, (2) Information Bottleneck Constraint (IB-Con) based on the Information Bottleneck (IB) principle for feature-level noise reduction, and (3) Semantic Consistency Constraint (SC-Con) based on mutual information maximization in a contrastive learning way for multi-grained semantic gap reduction. Extensive experiments on two datasets validate our new state-of-the-art performance.
Paper Structure (20 sections, 13 equations, 4 figures, 3 tables)

This paper contains 20 sections, 13 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: An example of JMASA.
  • Figure 2: An overview of our proposed RNG.
  • Figure 3: F1-score against different $\beta$ on two datasets (%).
  • Figure 4: Predictions of different methods on two test samples.