Table of Contents
Fetching ...

Resolving Sentiment Discrepancy for Multimodal Sentiment Detection via Semantics Completion and Decomposition

Daiqing Wu, Dongbao Yang, Huawen Shen, Can Ma, Yu Zhou

TL;DR

The paper addresses sentiment discrepancy in multimodal image-text posts by introducing CoDe, a network with semantics completion and semantics decomposition to jointly model cross-modal consistency and discrepancy. Semantics completion leverages in-image text to bridge image and text representations, while semantics decomposition splits modalities into shared and private sentiment components, guided by soft inter-modal contrastive learning. Fusion via cross-attention combines consistent sentiment with the learned discrepant sentiment, and the total objective combines classification, exclusive projection, and contrastive terms. Across four benchmark datasets, CoDe achieves state-of-the-art performance, with ablations confirming the contribution of each module and analyses indicating improved cross-modal alignment, robustness to varying in-image text quality, and generalization to aspect-based multimodal sentiment tasks.

Abstract

With the proliferation of social media posts in recent years, the need to detect sentiments in multimodal (image-text) content has grown rapidly. Since posts are user-generated, the image and text from the same post can express different or even contradictory sentiments, leading to potential \textbf{sentiment discrepancy}. However, existing works mainly adopt a single-branch fusion structure that primarily captures the consistent sentiment between image and text. The ignorance or implicit modeling of discrepant sentiment results in compromised unimodal encoding and limited performance. In this paper, we propose a semantics Completion and Decomposition (CoDe) network to resolve the above issue. In the semantics completion module, we complement image and text representations with the semantics of the in-image text, helping bridge the sentiment gap. In the semantics decomposition module, we decompose image and text representations with exclusive projection and contrastive learning, thereby explicitly capturing the discrepant sentiment between modalities. Finally, we fuse image and text representations by cross-attention and combine them with the learned discrepant sentiment for final classification. Extensive experiments on four datasets demonstrate the superiority of CoDe and the effectiveness of each proposed module.

Resolving Sentiment Discrepancy for Multimodal Sentiment Detection via Semantics Completion and Decomposition

TL;DR

The paper addresses sentiment discrepancy in multimodal image-text posts by introducing CoDe, a network with semantics completion and semantics decomposition to jointly model cross-modal consistency and discrepancy. Semantics completion leverages in-image text to bridge image and text representations, while semantics decomposition splits modalities into shared and private sentiment components, guided by soft inter-modal contrastive learning. Fusion via cross-attention combines consistent sentiment with the learned discrepant sentiment, and the total objective combines classification, exclusive projection, and contrastive terms. Across four benchmark datasets, CoDe achieves state-of-the-art performance, with ablations confirming the contribution of each module and analyses indicating improved cross-modal alignment, robustness to varying in-image text quality, and generalization to aspect-based multimodal sentiment tasks.

Abstract

With the proliferation of social media posts in recent years, the need to detect sentiments in multimodal (image-text) content has grown rapidly. Since posts are user-generated, the image and text from the same post can express different or even contradictory sentiments, leading to potential \textbf{sentiment discrepancy}. However, existing works mainly adopt a single-branch fusion structure that primarily captures the consistent sentiment between image and text. The ignorance or implicit modeling of discrepant sentiment results in compromised unimodal encoding and limited performance. In this paper, we propose a semantics Completion and Decomposition (CoDe) network to resolve the above issue. In the semantics completion module, we complement image and text representations with the semantics of the in-image text, helping bridge the sentiment gap. In the semantics decomposition module, we decompose image and text representations with exclusive projection and contrastive learning, thereby explicitly capturing the discrepant sentiment between modalities. Finally, we fuse image and text representations by cross-attention and combine them with the learned discrepant sentiment for final classification. Extensive experiments on four datasets demonstrate the superiority of CoDe and the effectiveness of each proposed module.
Paper Structure (25 sections, 14 equations, 7 figures, 10 tables)

This paper contains 25 sections, 14 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Posts from social media.
  • Figure 2: Framework comparison of single-branch fusion and our method.
  • Figure 3: Pipeline of CoDe. After encoding, two attention modules are developed to complement image and text representations with in-image text semantics. Following this, each complemented representation is decomposed into two components, with one learns the modality-shared sentiment, and the other learns the modality-private sentiment. Finally, the discrepant and consistent sentiments are explicitly modeled for classification.
  • Figure 4: An example of the soft inter-modal contrastive learning. Solid arrows pointing inward and black blocks indicate that the representations are brought closer. Dashed arrows pointing inward and grey blocks indicate the target of weakly bringing closer. The solid arrows pointing outward and the light grey blocks signify negative pairs to be pushed further. Each $s_{i}^{v}$ is brought closer to its counterpart $s_{i}^{t}$. $(s_{1}^{v},s_{3}^{t})$, with $w_{1,3}=0$, is treated as a negative pair and pushed away. $(s_{1}^{v},s_{2}^{t})$, with $w_{1,2}=0.5$, are treated as a partial positive pair and weakly brought closer. $(s_{3}^{v},s_{4}^{t})$, with $w_{3,4}=1$, is treated as a positive pair and brought closer.
  • Figure 5: Attention heatmaps of the image encoders of Att and CoDe for posts containing sentiment discrepancy. The white bounding boxes circle the foreground objects through which humans intuitively perceive visual sentiments.
  • ...and 2 more figures