Table of Contents
Fetching ...

Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques

Yuanchao Li, Peter Bell, Catherine Lai

TL;DR

The paper addresses the gap between laboratory SER results and real-world deployment by evaluating SER with ASR transcripts across eleven ASR models on three diverse corpora, examining how Word Error Rate ($WER$) and fusion strategies affect performance. It systematically compares text-only and bimodal SER with six fusion techniques, revealing that moderate $WER$ can be surprisingly tolerable and that fusion can mitigate degradation or even outperform ground-truth text in some cases. Building on these insights, the authors propose a two-stage ASR error-robust framework that combines ASR error correction (via N-best hypothesis aggregation and LLM/S2S refinement) with modality-gated fusion, achieving lower $WER$ and higher SER than the best ASR transcript in many scenarios. This work advances practically deployable SER systems by quantifying ASR impacts, identifying robust fusion patterns, and offering a concrete mechanism to handle transcription errors in real-world multimodal emotion recognition tasks.

Abstract

Text data is commonly utilized as a primary input to enhance Speech Emotion Recognition (SER) performance and reliability. However, the reliance on human-transcribed text in most studies impedes the development of practical SER systems, creating a gap between in-lab research and real-world scenarios where Automatic Speech Recognition (ASR) serves as the text source. Hence, this study benchmarks SER performance using ASR transcripts with varying Word Error Rates (WERs) from eleven models on three well-known corpora: IEMOCAP, CMU-MOSI, and MSP-Podcast. Our evaluation includes both text-only and bimodal SER with six fusion techniques, aiming for a comprehensive analysis that uncovers novel findings and challenges faced by current SER research. Additionally, we propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript. These findings provide insights into SER with ASR assistance, especially for real-world applications.

Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques

TL;DR

The paper addresses the gap between laboratory SER results and real-world deployment by evaluating SER with ASR transcripts across eleven ASR models on three diverse corpora, examining how Word Error Rate () and fusion strategies affect performance. It systematically compares text-only and bimodal SER with six fusion techniques, revealing that moderate can be surprisingly tolerable and that fusion can mitigate degradation or even outperform ground-truth text in some cases. Building on these insights, the authors propose a two-stage ASR error-robust framework that combines ASR error correction (via N-best hypothesis aggregation and LLM/S2S refinement) with modality-gated fusion, achieving lower and higher SER than the best ASR transcript in many scenarios. This work advances practically deployable SER systems by quantifying ASR impacts, identifying robust fusion patterns, and offering a concrete mechanism to handle transcription errors in real-world multimodal emotion recognition tasks.

Abstract

Text data is commonly utilized as a primary input to enhance Speech Emotion Recognition (SER) performance and reliability. However, the reliance on human-transcribed text in most studies impedes the development of practical SER systems, creating a gap between in-lab research and real-world scenarios where Automatic Speech Recognition (ASR) serves as the text source. Hence, this study benchmarks SER performance using ASR transcripts with varying Word Error Rates (WERs) from eleven models on three well-known corpora: IEMOCAP, CMU-MOSI, and MSP-Podcast. Our evaluation includes both text-only and bimodal SER with six fusion techniques, aiming for a comprehensive analysis that uncovers novel findings and challenges faced by current SER research. Additionally, we propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript. These findings provide insights into SER with ASR assistance, especially for real-world applications.
Paper Structure (12 sections, 1 equation, 1 figure, 3 tables, 1 algorithm)

This paper contains 12 sections, 1 equation, 1 figure, 3 tables, 1 algorithm.

Figures (1)

  • Figure 1: Proposed ASR error-robust framework. Blue: ASR error correction; frozen. Pink: Modality-gated fusion; trainable.