Revisiting Multimodal Emotion Recognition in Conversation from the Perspective of Graph Spectrum
Tao Meng, Fuchen Zhang, Yuntao Shou, Wei Ai, Nan Yin, Keqin Li
TL;DR
The paper addresses MERC by identifying the limitations of traditional GNNs in capturing long-distance dependencies and high-frequency signals within multimodal conversations. It introduces GS-MCC, a graph-spectrum framework that constructs a sliding-window multimodal interaction graph and uses efficient Fourier graph operators to extract long-distance low- and high-frequency information, representing consistency and complementarity respectively. A frequency-domain contrastive learning strategy (LFCL and HFCL) promotes collaboration between these frequency bands, and a lightweight classifier fuses the frequency-aware embeddings for emotion prediction. Extensive experiments on IEMOCAP and MELD show that GS-MCC achieves state-of-the-art performance with significantly fewer parameters (about 2.10M) and demonstrates improved convergence and reduced over-smoothing compared with baseline GNNs. The results validate the effectiveness and efficiency of learning through graph-spectrum perspectives for MERC.
Abstract
Efficiently capturing consistent and complementary semantic features in a multimodal conversation context is crucial for Multimodal Emotion Recognition in Conversation (MERC). Existing methods mainly use graph structures to model dialogue context semantic dependencies and employ Graph Neural Networks (GNN) to capture multimodal semantic features for emotion recognition. However, these methods are limited by some inherent characteristics of GNN, such as over-smoothing and low-pass filtering, resulting in the inability to learn long-distance consistency information and complementary information efficiently. Since consistency and complementarity information correspond to low-frequency and high-frequency information, respectively, this paper revisits the problem of multimodal emotion recognition in conversation from the perspective of the graph spectrum. Specifically, we propose a Graph-Spectrum-based Multimodal Consistency and Complementary collaborative learning framework GS-MCC. First, GS-MCC uses a sliding window to construct a multimodal interaction graph to model conversational relationships and uses efficient Fourier graph operators to extract long-distance high-frequency and low-frequency information, respectively. Then, GS-MCC uses contrastive learning to construct self-supervised signals that reflect complementarity and consistent semantic collaboration with high and low-frequency signals, thereby improving the ability of high and low-frequency information to reflect real emotions. Finally, GS-MCC inputs the collaborative high and low-frequency information into the MLP network and softmax function for emotion prediction. Extensive experiments have proven the superiority of the GS-MCC architecture proposed in this paper on two benchmark data sets.
