Table of Contents
Fetching ...

SPP-SCL: Semi-Push-Pull Supervised Contrastive Learning for Image-Text Sentiment Analysis and Beyond

Jiesheng Wu, Shengrong Li

TL;DR

Experimental studies on three public image-text sentiment and sarcasm detection datasets demonstrate that SPP-SCL significantly outperforms state-of-the-art methods by a large margin and is more discriminative in sentiment.

Abstract

Existing Image-Text Sentiment Analysis (ITSA) methods may suffer from inconsistent intra-modal and inter-modal sentiment relationships. Therefore, we develop a method that balances before fusing to solve the issue of vision-language imbalance intra-modal and inter-modal sentiment relationships; that is, a Semi-Push-Pull Supervised Contrastive Learning (SPP-SCL) method is proposed. Specifically, the method is implemented using a novel two-step strategy, namely first using the proposed intra-modal supervised contrastive learning to pull the relationships between the intra-modal and then performing a well-designed conditional execution statement. If the statement result is false, our method will perform the second step, which is inter-modal supervised contrastive learning to push away the relationships between inter-modal. The two-step strategy will balance the intra-modal and inter-modal relationships to achieve the purpose of relationship consistency and finally perform cross-modal feature fusion for sentiment analysis and detection. Experimental studies on three public image-text sentiment and sarcasm detection datasets demonstrate that SPP-SCL significantly outperforms state-of-the-art methods by a large margin and is more discriminative in sentiment.

SPP-SCL: Semi-Push-Pull Supervised Contrastive Learning for Image-Text Sentiment Analysis and Beyond

TL;DR

Experimental studies on three public image-text sentiment and sarcasm detection datasets demonstrate that SPP-SCL significantly outperforms state-of-the-art methods by a large margin and is more discriminative in sentiment.

Abstract

Existing Image-Text Sentiment Analysis (ITSA) methods may suffer from inconsistent intra-modal and inter-modal sentiment relationships. Therefore, we develop a method that balances before fusing to solve the issue of vision-language imbalance intra-modal and inter-modal sentiment relationships; that is, a Semi-Push-Pull Supervised Contrastive Learning (SPP-SCL) method is proposed. Specifically, the method is implemented using a novel two-step strategy, namely first using the proposed intra-modal supervised contrastive learning to pull the relationships between the intra-modal and then performing a well-designed conditional execution statement. If the statement result is false, our method will perform the second step, which is inter-modal supervised contrastive learning to push away the relationships between inter-modal. The two-step strategy will balance the intra-modal and inter-modal relationships to achieve the purpose of relationship consistency and finally perform cross-modal feature fusion for sentiment analysis and detection. Experimental studies on three public image-text sentiment and sarcasm detection datasets demonstrate that SPP-SCL significantly outperforms state-of-the-art methods by a large margin and is more discriminative in sentiment.
Paper Structure (27 sections, 7 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 27 sections, 7 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of intra-modal and inter-modal sentiment distances under different training strategies. (a) Cross-entropy loss only: image and text sentiment representations are scattered, leading to inconsistent intra- and inter-modal sentiment relationships. (b) Intra-modal contrastive learning aligns IID and TTD, but fails on ITD. (c) Inter-modal contrastive learning aligns ITD but ignores intra-modal structure. (d) Our proposed SPP-SCL balances all three sentiment distances, yielding consistent sentiment embeddings.
  • Figure 2: Overall architecture of SPP-SCL. The framework includes two main steps: intra-modal sentiment alignment via supervised contrastive learning ($\mathcal{L}_{cl_{i}}$ and $\mathcal{L}_{cl_{t}}$), and conditional inter-modal sentiment alignment ($\mathcal{L}_{cl_{m}}$).
  • Figure 3: Visualization of the fusion feature distribution on the three datasets.
  • Figure 4: Visualization of the sentiment distance distribution.
  • Figure 5: Sensitivity of hyperparameter $\alpha$.