AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations

Sheng Wu; Jiaxing Liu; Longbiao Wang; Dongxiao He; Xiaobao Wang; Jianwu Dang

AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations

Sheng Wu, Jiaxing Liu, Longbiao Wang, Dongxiao He, Xiaobao Wang, Jianwu Dang

TL;DR

The paper addresses emotion recognition in conversations (ERC) using multimodal cues and identifies limitations in existing fusion strategies. It introduces AIMDiT, a two-component framework comprising a Modality Augmentation Network (MAN) that transforms 1D modality sequences into 2D representations for rich intra- and inter-modal learning via Inception blocks, and a Modality Interaction Network (MIN) that fuses modalities through cross-modal and self-modal Transformer mechanisms. The authors demonstrate that MAN and MIN together yield superior ERC performance on the MELD dataset, achieving improvements of 2.35 percentage points in Acc-7 and 2.87 points in w-F1 over the state-of-the-art. This work provides a scalable, effective approach to multimodal fusion in ERC with potential impact on human-computer interaction, social analysis, and dynamic dialogue systems.

Abstract

Emotion Recognition in Conversations (ERC) is a popular task in natural language processing, which aims to recognize the emotional state of the speaker in conversations. While current research primarily emphasizes contextual modeling, there exists a dearth of investigation into effective multimodal fusion methods. We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features. Specifically, we design a Modality Augmentation Network which performs rich representation learning through dimension transformation of different modalities and parameter-efficient inception block. On the other hand, the Modality Interaction Network performs interaction fusion of extracted inter-modal features and intra-modal features. Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics compared to the state-of-the-art (SOTA) models.

AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations

TL;DR

Abstract

AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations

Authors

TL;DR

Abstract

Table of Contents

Figures (1)