Resolving Sentiment Discrepancy for Multimodal Sentiment Detection via Semantics Completion and Decomposition
Daiqing Wu, Dongbao Yang, Huawen Shen, Can Ma, Yu Zhou
TL;DR
The paper addresses sentiment discrepancy in multimodal image-text posts by introducing CoDe, a network with semantics completion and semantics decomposition to jointly model cross-modal consistency and discrepancy. Semantics completion leverages in-image text to bridge image and text representations, while semantics decomposition splits modalities into shared and private sentiment components, guided by soft inter-modal contrastive learning. Fusion via cross-attention combines consistent sentiment with the learned discrepant sentiment, and the total objective combines classification, exclusive projection, and contrastive terms. Across four benchmark datasets, CoDe achieves state-of-the-art performance, with ablations confirming the contribution of each module and analyses indicating improved cross-modal alignment, robustness to varying in-image text quality, and generalization to aspect-based multimodal sentiment tasks.
Abstract
With the proliferation of social media posts in recent years, the need to detect sentiments in multimodal (image-text) content has grown rapidly. Since posts are user-generated, the image and text from the same post can express different or even contradictory sentiments, leading to potential \textbf{sentiment discrepancy}. However, existing works mainly adopt a single-branch fusion structure that primarily captures the consistent sentiment between image and text. The ignorance or implicit modeling of discrepant sentiment results in compromised unimodal encoding and limited performance. In this paper, we propose a semantics Completion and Decomposition (CoDe) network to resolve the above issue. In the semantics completion module, we complement image and text representations with the semantics of the in-image text, helping bridge the sentiment gap. In the semantics decomposition module, we decompose image and text representations with exclusive projection and contrastive learning, thereby explicitly capturing the discrepant sentiment between modalities. Finally, we fuse image and text representations by cross-attention and combine them with the learned discrepant sentiment for final classification. Extensive experiments on four datasets demonstrate the superiority of CoDe and the effectiveness of each proposed module.
