Table of Contents
Fetching ...

PaSE: Prototype-aligned Calibration and Shapley-based Equilibrium for Multimodal Sentiment Analysis

Kang He, Boyu Chen, Yuzhe Ding, Fei Li, Chong Teng, Donghong Ji

TL;DR

PaSE addresses modality competition in multimodal sentiment analysis by jointly applying prototype-aligned calibration and Shapley-based equilibrium to balance cross-modal contributions. It combines intra-modal prototype refinement, entropic optimal transport-based cross-modal alignment, and a dual-phase optimization that first fuses modalities adaptively and then reweighs gradients to mitigate dominance. The key contributions are (i) Prototype-guided Calibration Learning, (ii) Entropic Optimal Transport-based cross-modal alignment, (iii) Prototype-Gated Fusion with context-aware gating, and (iv) Shapley-guided Gradient Modulation for balanced optimization, yielding state-of-the-art results on MOSI, MOSEI, and IEMOCAP. The approach offers improved robustness, interpretability, and balanced multimodal representations, with practical relevance for dialogue systems and social signal processing.

Abstract

Multimodal Sentiment Analysis (MSA) seeks to understand human emotions by integrating textual, acoustic, and visual signals. Although multimodal fusion is designed to leverage cross-modal complementarity, real-world scenarios often exhibit modality competition: dominant modalities tend to overshadow weaker ones, leading to suboptimal performance. In this paper, we propose PaSE, a novel Prototype-aligned Calibration and Shapley-optimized Equilibrium framework, which enhances collaboration while explicitly mitigating modality competition. PaSE first applies Prototype-guided Calibration Learning (PCL) to refine unimodal representations and align them through an Entropic Optimal Transport mechanism that ensures semantic consistency. To further stabilize optimization, we introduce a Dual-Phase Optimization strategy. A prototype-gated fusion module is first used to extract shared representations, followed by Shapley-based Gradient Modulation (SGM), which adaptively adjusts gradients according to the contribution of each modality. Extensive experiments on IEMOCAP, MOSI, and MOSEI confirm that PaSE achieves the superior performance and effectively alleviates modality competition.

PaSE: Prototype-aligned Calibration and Shapley-based Equilibrium for Multimodal Sentiment Analysis

TL;DR

PaSE addresses modality competition in multimodal sentiment analysis by jointly applying prototype-aligned calibration and Shapley-based equilibrium to balance cross-modal contributions. It combines intra-modal prototype refinement, entropic optimal transport-based cross-modal alignment, and a dual-phase optimization that first fuses modalities adaptively and then reweighs gradients to mitigate dominance. The key contributions are (i) Prototype-guided Calibration Learning, (ii) Entropic Optimal Transport-based cross-modal alignment, (iii) Prototype-Gated Fusion with context-aware gating, and (iv) Shapley-guided Gradient Modulation for balanced optimization, yielding state-of-the-art results on MOSI, MOSEI, and IEMOCAP. The approach offers improved robustness, interpretability, and balanced multimodal representations, with practical relevance for dialogue systems and social signal processing.

Abstract

Multimodal Sentiment Analysis (MSA) seeks to understand human emotions by integrating textual, acoustic, and visual signals. Although multimodal fusion is designed to leverage cross-modal complementarity, real-world scenarios often exhibit modality competition: dominant modalities tend to overshadow weaker ones, leading to suboptimal performance. In this paper, we propose PaSE, a novel Prototype-aligned Calibration and Shapley-optimized Equilibrium framework, which enhances collaboration while explicitly mitigating modality competition. PaSE first applies Prototype-guided Calibration Learning (PCL) to refine unimodal representations and align them through an Entropic Optimal Transport mechanism that ensures semantic consistency. To further stabilize optimization, we introduce a Dual-Phase Optimization strategy. A prototype-gated fusion module is first used to extract shared representations, followed by Shapley-based Gradient Modulation (SGM), which adaptively adjusts gradients according to the contribution of each modality. Extensive experiments on IEMOCAP, MOSI, and MOSEI confirm that PaSE achieves the superior performance and effectively alleviates modality competition.

Paper Structure

This paper contains 37 sections, 21 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Performance improvements (F1 score) from adding audio (+A), visual (+V), or both modalities (+VA) to text-only baselines on the CMU-MOSI datasets.
  • Figure 2: The overall architecture of our proposed model PaSE.
  • Figure 3: t-SNE visualization of fused feature representations obtained using different models on MOSI test set. (a) SelfMM, (b)EUAR, and (c) PaSE (ours).
  • Figure 4: Heatmap of modality contributions for four emotion categories on IEMOCAP. Text, Visual, and Audio modalities show varying influence, with Text generally dominating, especially in Neutral and Sadness.
  • Figure 5: The influence of SGM on different modalities and the overall performance. The dual y-axis shows the average contribution of each modality (left) and the corresponding Acc-2 score (right).
  • ...and 1 more figures