Table of Contents
Fetching ...

AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations

Sheng Wu, Jiaxing Liu, Longbiao Wang, Dongxiao He, Xiaobao Wang, Jianwu Dang

TL;DR

The paper addresses emotion recognition in conversations (ERC) using multimodal cues and identifies limitations in existing fusion strategies. It introduces AIMDiT, a two-component framework comprising a Modality Augmentation Network (MAN) that transforms 1D modality sequences into 2D representations for rich intra- and inter-modal learning via Inception blocks, and a Modality Interaction Network (MIN) that fuses modalities through cross-modal and self-modal Transformer mechanisms. The authors demonstrate that MAN and MIN together yield superior ERC performance on the MELD dataset, achieving improvements of 2.35 percentage points in Acc-7 and 2.87 points in w-F1 over the state-of-the-art. This work provides a scalable, effective approach to multimodal fusion in ERC with potential impact on human-computer interaction, social analysis, and dynamic dialogue systems.

Abstract

Emotion Recognition in Conversations (ERC) is a popular task in natural language processing, which aims to recognize the emotional state of the speaker in conversations. While current research primarily emphasizes contextual modeling, there exists a dearth of investigation into effective multimodal fusion methods. We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features. Specifically, we design a Modality Augmentation Network which performs rich representation learning through dimension transformation of different modalities and parameter-efficient inception block. On the other hand, the Modality Interaction Network performs interaction fusion of extracted inter-modal features and intra-modal features. Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics compared to the state-of-the-art (SOTA) models.

AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations

TL;DR

The paper addresses emotion recognition in conversations (ERC) using multimodal cues and identifies limitations in existing fusion strategies. It introduces AIMDiT, a two-component framework comprising a Modality Augmentation Network (MAN) that transforms 1D modality sequences into 2D representations for rich intra- and inter-modal learning via Inception blocks, and a Modality Interaction Network (MIN) that fuses modalities through cross-modal and self-modal Transformer mechanisms. The authors demonstrate that MAN and MIN together yield superior ERC performance on the MELD dataset, achieving improvements of 2.35 percentage points in Acc-7 and 2.87 points in w-F1 over the state-of-the-art. This work provides a scalable, effective approach to multimodal fusion in ERC with potential impact on human-computer interaction, social analysis, and dynamic dialogue systems.

Abstract

Emotion Recognition in Conversations (ERC) is a popular task in natural language processing, which aims to recognize the emotional state of the speaker in conversations. While current research primarily emphasizes contextual modeling, there exists a dearth of investigation into effective multimodal fusion methods. We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features. Specifically, we design a Modality Augmentation Network which performs rich representation learning through dimension transformation of different modalities and parameter-efficient inception block. On the other hand, the Modality Interaction Network performs interaction fusion of extracted inter-modal features and intra-modal features. Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics compared to the state-of-the-art (SOTA) models.
Paper Structure (11 sections, 10 equations, 1 figure, 4 tables)

This paper contains 11 sections, 10 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Framework illustration of the AIMDiT based emotion recognition in conversations, which consists of four key components: Modality Encoder, Modality Augmentation Network, Modality Interaction Network, Emotion Classifier.