Table of Contents
Fetching ...

Revisiting Multimodal Emotion Recognition in Conversation from the Perspective of Graph Spectrum

Tao Meng, Fuchen Zhang, Yuntao Shou, Wei Ai, Nan Yin, Keqin Li

TL;DR

The paper addresses MERC by identifying the limitations of traditional GNNs in capturing long-distance dependencies and high-frequency signals within multimodal conversations. It introduces GS-MCC, a graph-spectrum framework that constructs a sliding-window multimodal interaction graph and uses efficient Fourier graph operators to extract long-distance low- and high-frequency information, representing consistency and complementarity respectively. A frequency-domain contrastive learning strategy (LFCL and HFCL) promotes collaboration between these frequency bands, and a lightweight classifier fuses the frequency-aware embeddings for emotion prediction. Extensive experiments on IEMOCAP and MELD show that GS-MCC achieves state-of-the-art performance with significantly fewer parameters (about 2.10M) and demonstrates improved convergence and reduced over-smoothing compared with baseline GNNs. The results validate the effectiveness and efficiency of learning through graph-spectrum perspectives for MERC.

Abstract

Efficiently capturing consistent and complementary semantic features in a multimodal conversation context is crucial for Multimodal Emotion Recognition in Conversation (MERC). Existing methods mainly use graph structures to model dialogue context semantic dependencies and employ Graph Neural Networks (GNN) to capture multimodal semantic features for emotion recognition. However, these methods are limited by some inherent characteristics of GNN, such as over-smoothing and low-pass filtering, resulting in the inability to learn long-distance consistency information and complementary information efficiently. Since consistency and complementarity information correspond to low-frequency and high-frequency information, respectively, this paper revisits the problem of multimodal emotion recognition in conversation from the perspective of the graph spectrum. Specifically, we propose a Graph-Spectrum-based Multimodal Consistency and Complementary collaborative learning framework GS-MCC. First, GS-MCC uses a sliding window to construct a multimodal interaction graph to model conversational relationships and uses efficient Fourier graph operators to extract long-distance high-frequency and low-frequency information, respectively. Then, GS-MCC uses contrastive learning to construct self-supervised signals that reflect complementarity and consistent semantic collaboration with high and low-frequency signals, thereby improving the ability of high and low-frequency information to reflect real emotions. Finally, GS-MCC inputs the collaborative high and low-frequency information into the MLP network and softmax function for emotion prediction. Extensive experiments have proven the superiority of the GS-MCC architecture proposed in this paper on two benchmark data sets.

Revisiting Multimodal Emotion Recognition in Conversation from the Perspective of Graph Spectrum

TL;DR

The paper addresses MERC by identifying the limitations of traditional GNNs in capturing long-distance dependencies and high-frequency signals within multimodal conversations. It introduces GS-MCC, a graph-spectrum framework that constructs a sliding-window multimodal interaction graph and uses efficient Fourier graph operators to extract long-distance low- and high-frequency information, representing consistency and complementarity respectively. A frequency-domain contrastive learning strategy (LFCL and HFCL) promotes collaboration between these frequency bands, and a lightweight classifier fuses the frequency-aware embeddings for emotion prediction. Extensive experiments on IEMOCAP and MELD show that GS-MCC achieves state-of-the-art performance with significantly fewer parameters (about 2.10M) and demonstrates improved convergence and reduced over-smoothing compared with baseline GNNs. The results validate the effectiveness and efficiency of learning through graph-spectrum perspectives for MERC.

Abstract

Efficiently capturing consistent and complementary semantic features in a multimodal conversation context is crucial for Multimodal Emotion Recognition in Conversation (MERC). Existing methods mainly use graph structures to model dialogue context semantic dependencies and employ Graph Neural Networks (GNN) to capture multimodal semantic features for emotion recognition. However, these methods are limited by some inherent characteristics of GNN, such as over-smoothing and low-pass filtering, resulting in the inability to learn long-distance consistency information and complementary information efficiently. Since consistency and complementarity information correspond to low-frequency and high-frequency information, respectively, this paper revisits the problem of multimodal emotion recognition in conversation from the perspective of the graph spectrum. Specifically, we propose a Graph-Spectrum-based Multimodal Consistency and Complementary collaborative learning framework GS-MCC. First, GS-MCC uses a sliding window to construct a multimodal interaction graph to model conversational relationships and uses efficient Fourier graph operators to extract long-distance high-frequency and low-frequency information, respectively. Then, GS-MCC uses contrastive learning to construct self-supervised signals that reflect complementarity and consistent semantic collaboration with high and low-frequency signals, thereby improving the ability of high and low-frequency information to reflect real emotions. Finally, GS-MCC inputs the collaborative high and low-frequency information into the MLP network and softmax function for emotion prediction. Extensive experiments have proven the superiority of the GS-MCC architecture proposed in this paper on two benchmark data sets.
Paper Structure (17 sections, 22 equations, 4 figures, 4 tables)

This paper contains 17 sections, 22 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: An example of a multimodal conversation from the MELD dataset. MERC aims to identify each utterance's emotion label (e.g., Neutral, Surprise, Joy).
  • Figure 2: The overall architecture of the proposed model GS-MCC. Specifically, feature embedding of multimodal utterances and speaker information is first performed, and then the embedded features are used to construct a multimodal semantic interaction graph. Then, a Fourier graph neural network is used to capture long-distance dependent high and low-frequency information, and finally, contrastive learning is used to collaborate high and low-frequency information for emotion recognition.
  • Figure 3: Loss trends during model training and inference on the IEMOCAP and MELD datasets. We compare DialogueGCN, GS-MCC without contrastive loss and GS-MCC.
  • Figure 4: Emotion recognition performance of DialogueGCN and GS-MCC on IEMOCAP and MELD datasets. We stack 4-layer and 8-layer GCN to explore the over-smoothing phenomenon of the model.