Table of Contents
Fetching ...

TCAN: Text-oriented Cross Attention Network for Multimodal Sentiment Analysis

Weize Quan, Yunfei Feng, Ming Zhou, Yunzhen Zhao, Tong Wang, Dong-Ming Yan

TL;DR

The paper tackles multimodal sentiment analysis under unaligned sequences by proposing TCAN, a text-oriented cross-attention network that treats text as the dominant modality. It introduces two bi-modal fusion streams (text-visual and text-acoustic), a gated cross-attention mechanism to suppress noise, and a unimodal joint-learning branch with a shared-weight encoder to enforce cross-modal homogeneousness. Empirical results on CMU-MOSI and CMU-MOSEI show TCAN achieving state-of-the-art performance across multiple metrics, validating the effectiveness of text-driven fusion and noise reduction. The approach offers a robust, scalable framework for integrating heterogeneous modalities in real-world social media data where alignment and signal quality vary substantially.

Abstract

Multimodal Sentiment Analysis (MSA) endeavors to understand human sentiment by leveraging language, visual, and acoustic modalities. Despite the remarkable performance exhibited by previous MSA approaches, the presence of inherent multimodal heterogeneities poses a challenge, with the contribution of different modalities varying considerably. Past research predominantly focused on improving representation learning techniques and feature fusion strategies. However, many of these efforts overlooked the variation in semantic richness among different modalities, treating each modality uniformly. This approach may lead to underestimating the significance of strong modalities while overemphasizing the importance of weak ones. Motivated by these insights, we introduce a Text-oriented Cross-Attention Network (TCAN), emphasizing the predominant role of the text modality in MSA. Specifically, for each multimodal sample, by taking unaligned sequences of the three modalities as inputs, we initially allocate the extracted unimodal features into a visual-text and an acoustic-text pair. Subsequently, we implement self-attention on the text modality and apply text-queried cross-attention to the visual and acoustic modalities. To mitigate the influence of noise signals and redundant features, we incorporate a gated control mechanism into the framework. Additionally, we introduce unimodal joint learning to gain a deeper understanding of homogeneous emotional tendencies across diverse modalities through backpropagation. Experimental results demonstrate that TCAN consistently outperforms state-of-the-art MSA methods on two datasets (CMU-MOSI and CMU-MOSEI).

TCAN: Text-oriented Cross Attention Network for Multimodal Sentiment Analysis

TL;DR

The paper tackles multimodal sentiment analysis under unaligned sequences by proposing TCAN, a text-oriented cross-attention network that treats text as the dominant modality. It introduces two bi-modal fusion streams (text-visual and text-acoustic), a gated cross-attention mechanism to suppress noise, and a unimodal joint-learning branch with a shared-weight encoder to enforce cross-modal homogeneousness. Empirical results on CMU-MOSI and CMU-MOSEI show TCAN achieving state-of-the-art performance across multiple metrics, validating the effectiveness of text-driven fusion and noise reduction. The approach offers a robust, scalable framework for integrating heterogeneous modalities in real-world social media data where alignment and signal quality vary substantially.

Abstract

Multimodal Sentiment Analysis (MSA) endeavors to understand human sentiment by leveraging language, visual, and acoustic modalities. Despite the remarkable performance exhibited by previous MSA approaches, the presence of inherent multimodal heterogeneities poses a challenge, with the contribution of different modalities varying considerably. Past research predominantly focused on improving representation learning techniques and feature fusion strategies. However, many of these efforts overlooked the variation in semantic richness among different modalities, treating each modality uniformly. This approach may lead to underestimating the significance of strong modalities while overemphasizing the importance of weak ones. Motivated by these insights, we introduce a Text-oriented Cross-Attention Network (TCAN), emphasizing the predominant role of the text modality in MSA. Specifically, for each multimodal sample, by taking unaligned sequences of the three modalities as inputs, we initially allocate the extracted unimodal features into a visual-text and an acoustic-text pair. Subsequently, we implement self-attention on the text modality and apply text-queried cross-attention to the visual and acoustic modalities. To mitigate the influence of noise signals and redundant features, we incorporate a gated control mechanism into the framework. Additionally, we introduce unimodal joint learning to gain a deeper understanding of homogeneous emotional tendencies across diverse modalities through backpropagation. Experimental results demonstrate that TCAN consistently outperforms state-of-the-art MSA methods on two datasets (CMU-MOSI and CMU-MOSEI).
Paper Structure (17 sections, 14 equations, 5 figures, 6 tables)

This paper contains 17 sections, 14 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: (a) illustrates the significant sentiment analysis discrepancies using unimodality, adapted from Mult tsai2019multimodal. (b) shows an example of unimodal labels and multimodal labels, where the green dotted lines represent the process of backpropagation.
  • Figure 2: The framework of TCAN. Given the input multimodal data, TCAN encodes their respective shallow features $F_{m}$, where $m\in \left \{ t,v,a \right \}$. In the Text-oriented Cross-attention module, TCAN exploits cross-attention and self-attention mechanisms to process text-audio pairs and text-video pairs and applies the gated mechanism to control the impact of noise and redundant information (Section \ref{['sec:3.3']}). At the same time, we introduce a shared-weight encoder called the Homogeneous encoder, which is used to extract homogeneous features from unimodal for joint training (Section \ref{['sec:3.4']}). Finally, we compute the final representation from the outputs of the last layer in the Text-oriented Cross-attention module and then concatenate them for MSA.
  • Figure 3: The architecture of the Text-oriented Cross-attention module. The cross-attention block takes text-audio and text-video pairs as its input and the self-attention block takes text as its input. After the cross-attention block, two gates were introduced to eliminate adverse information. The memory gate decides how much proportion of the target modality's components to be kept forwarding, and the fuse gate decides how much proportion of fused components to be injected into the target modality.
  • Figure 4: Performance of TCAN with different hyperparameter $N$ on the CMU-MOSI. Here we use $F_{1}$ scores to show the performance.
  • Figure 5: Some representative examples for the visualization analysis of MSA. Red represents positive sentiment and blue represents negative sentiment.