MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction

Jiajun He; Xiaohan Shi; Xingfeng Li; Tomoki Toda

MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction

Jiajun He, Xiaohan Shi, Xingfeng Li, Tomoki Toda

TL;DR

The paper tackles speech emotion recognition under ASR-induced text errors by proposing MF-AED-AEC, a multi-task framework that jointly learns AED and AEC to improve semantic coherence of ASR text and a multimodal fusion module to align modality-specific and invariant representations. By training with three cross-entropy losses and excluding AED/AEC at inference, the method achieves significant improvements on IEMOCAP, notably surpassing transcript-based baselines with a reported $4.1\%$ UAR gain. The approach advances robust SER by integrating error-sensitive text refinement with cross-modal learning, enabling more reliable emotion inference from noisy ASR outputs. The results suggest practical impact for real-world applications where accurate text transcripts are unavailable or imperfect. Future work may extend the framework to include visual modalities and contrastive objectives to further enhance multimodal emotion understanding.

Abstract

The prevalent approach in speech emotion recognition (SER) involves integrating both audio and textual information to comprehensively identify the speaker's emotion, with the text generally obtained through automatic speech recognition (ASR). An essential issue of this approach is that ASR errors from the text modality can worsen the performance of SER. Previous studies have proposed using an auxiliary ASR error detection task to adaptively assign weights of each word in ASR hypotheses. However, this approach has limited improvement potential because it does not address the coherence of semantic information in the text. Additionally, the inherent heterogeneity of different modalities leads to distribution gaps between their representations, making their fusion challenging. Therefore, in this paper, we incorporate two auxiliary tasks, ASR error detection (AED) and ASR error correction (AEC), to enhance the semantic coherence of ASR text, and further introduce a novel multi-modal fusion (MF) method to learn shared representations across modalities. We refer to our method as MF-AED-AEC. Experimental results indicate that MF-AED-AEC significantly outperforms the baseline model by a margin of 4.1\%.

MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction

TL;DR

UAR gain. The approach advances robust SER by integrating error-sensitive text refinement with cross-modal learning, enabling more reliable emotion inference from noisy ASR outputs. The results suggest practical impact for real-world applications where accurate text transcripts are unavailable or imperfect. Future work may extend the framework to include visual modalities and contrastive objectives to further enhance multimodal emotion understanding.

Abstract

Paper Structure (14 sections, 17 equations, 1 figure, 2 tables)

This paper contains 14 sections, 17 equations, 1 figure, 2 tables.

Introduction
Proposed Method
Problem Formulation
Embedding Module
ASR Error Detection (AED) Module
ASR Error Correction (AEC) Module
Multimodal Fusion (MF) Module
Emotion Classification Module
Joint Training
Experiments and Results
Experiment Settings
Dataset
Results and Analysis
CONCLUSION

Figures (1)

Figure 1: Overall architecture of the proposed MF-AED-AEC model.

MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction

TL;DR

Abstract

MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction

Authors

TL;DR

Abstract

Table of Contents

Figures (1)