DER-GCN: Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialogue Emotion Recognition

Wei Ai; Yuntao Shou; Tao Meng; Nan Yin; Keqin Li

DER-GCN: Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialogue Emotion Recognition

Wei Ai, Yuntao Shou, Tao Meng, Nan Yin, Keqin Li

TL;DR

A novel Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition (DER-GCN) method, which models dialogue relations between speakers and captures latent event relations information and introduces a Self-Supervised Masked Graph Autoencoder to improve the fusion representation ability of features and structures.

Abstract

With the continuous development of deep learning (DL), the task of multimodal dialogue emotion recognition (MDER) has recently received extensive research attention, which is also an essential branch of DL. The MDER aims to identify the emotional information contained in different modalities, e.g., text, video, and audio, in different dialogue scenes. However, existing research has focused on modeling contextual semantic information and dialogue relations between speakers while ignoring the impact of event relations on emotion. To tackle the above issues, we propose a novel Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition (DER-GCN) method. It models dialogue relations between speakers and captures latent event relations information. Specifically, we construct a weighted multi-relationship graph to simultaneously capture the dependencies between speakers and event relations in a dialogue. Moreover, we also introduce a Self-Supervised Masked Graph Autoencoder (SMGAE) to improve the fusion representation ability of features and structures. Next, we design a new Multiple Information Transformer (MIT) to capture the correlation between different relations, which can provide a better fuse of the multivariate information between relations. Finally, we propose a loss optimization strategy based on contrastive learning to enhance the representation learning ability of minority class features. We conduct extensive experiments on the IEMOCAP and MELD benchmark datasets, which verify the effectiveness of the DER-GCN model. The results demonstrate that our model significantly improves both the average accuracy and the f1 value of emotion recognition.

DER-GCN: Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialogue Emotion Recognition

TL;DR

Abstract

Paper Structure (32 sections, 27 equations, 7 figures, 5 tables)

This paper contains 32 sections, 27 equations, 7 figures, 5 tables.

Introduction
Motivation
Our Contributions
Related Work
Emotion Recognition in Conversation
Transformers for Dialogue Generation
Masked Self-Supervised Graph Learning
Balanced Optimization Based on Contrastive Learning
Preliminary Information
Problem Definition
Methodology
The Design of the DER-GCN Structure
Sequence Modeling and Cross-modal Feature Fusion
Weighted Multi-relational Affective Interaction Graph
Self-Supervised Masked Graph Autoencoder
...and 17 more sections

Figures (7)

Figure 1: An illustrative example of the impact of event relationships on the spatial distribution of emotion categories. (a) Raw dialogue text with four speakers. (b) A graph of dialogue relationships composed of emotional interactions between speakers. (c) The emotional interaction graph is composed of dialogue and event relationships. (d) Spatial distribution of emotion categories in graphs composed of dialogue relations. (e) Spatial distribution of emotion categories in graphs composed of dialogue and event relationship.
Figure 2: (a) The overall process framework of DER-GCN: It first preprocesses multimodal data to obtain encoded feature embeddings via NN-1. Second, it uses NN-2 to achieve cross-modal feature fusion. Third, it constructs a weighted multi-relational dialogue and event relation-aware graph through the fused feature vectors. Fourth, node and edge features are reconstructed via NN-3. Fifth, the fused multi-relational information feature vectors are obtained through the Multiple Information Transformer, and a loss optimization strategy based on contrastive learning is used to solve the data imbalance problem. Finally it uses the emotion classifier to get the final emotion label. (b) NN-1: multimodal feature encoder. (c) NN-2: cross-modal feature aggregator. (d) NN-3: self-supervised masked graph encoder.
Figure 3: (a) Heterogeneous dialogue graph composed of dialogue relations and event relations. (b) We split the heterogeneous graph to construct a weighted multi-relational dialogue graph.
Figure 4: The Multiple Information Transformer (MIT) consists of multiple Transformer modules, each containing multiple linear layers, 1D-Conv, and softmax layers. MIT captures the underlying joint distribution between different relations by transferring information.
Figure 5: The classification of DER-GCN and LR-GCN on the IEMOCAP and MELD dataset.
...and 2 more figures

DER-GCN: Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialogue Emotion Recognition

TL;DR

Abstract

DER-GCN: Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialogue Emotion Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (7)