Table of Contents
Fetching ...

Metadata-Enhanced Speech Emotion Recognition: Augmented Residual Integration and Co-Attention in Two-Stage Fine-Tuning

Zixiang Wan, Ziyue Qiu, Yiyang Liu, Wei-Qiang Zhang

TL;DR

This work tackles SER by leveraging metadata through a metadata-enhanced, two-stage fine-tuning framework for transformer-based SSL models. It introduces Augmented Residual Integration (ARI) to preserve multi-level features across Transformer layers and a Co-attention mechanism to exploit relationships among auxiliary tasks, enabling effective fusion of metadata information. Evaluated on IEMOCAP with three SSL encoders, the approach achieves state-of-the-art UA/WA under speaker-independent conditions, with notable gains for ASR-related auxiliary tasks and consistent improvements across encoders. The findings demonstrate that metadata-aware multitask learning, when coupled with ARI and Co-attention, enhances robust emotion recognition and suggests broad applicability to other speech tasks using self-supervised representations.

Abstract

Speech Emotion Recognition (SER) involves analyzing vocal expressions to determine the emotional state of speakers, where the comprehensive and thorough utilization of audio information is paramount. Therefore, we propose a novel approach on self-supervised learning (SSL) models that employs all available auxiliary information -- specifically metadata -- to enhance performance. Through a two-stage fine-tuning method in multi-task learning, we introduce the Augmented Residual Integration (ARI) module, which enhances transformer layers in encoder of SSL models. The module efficiently preserves acoustic features across all different levels, thereby significantly improving the performance of metadata-related auxiliary tasks that require various levels of features. Moreover, the Co-attention module is incorporated due to its complementary nature with ARI, enabling the model to effectively utilize multidimensional information and contextual relationships from metadata-related auxiliary tasks. Under pre-trained base models and speaker-independent setup, our approach consistently surpasses state-of-the-art (SOTA) models on multiple SSL encoders for the IEMOCAP dataset.

Metadata-Enhanced Speech Emotion Recognition: Augmented Residual Integration and Co-Attention in Two-Stage Fine-Tuning

TL;DR

This work tackles SER by leveraging metadata through a metadata-enhanced, two-stage fine-tuning framework for transformer-based SSL models. It introduces Augmented Residual Integration (ARI) to preserve multi-level features across Transformer layers and a Co-attention mechanism to exploit relationships among auxiliary tasks, enabling effective fusion of metadata information. Evaluated on IEMOCAP with three SSL encoders, the approach achieves state-of-the-art UA/WA under speaker-independent conditions, with notable gains for ASR-related auxiliary tasks and consistent improvements across encoders. The findings demonstrate that metadata-aware multitask learning, when coupled with ARI and Co-attention, enhances robust emotion recognition and suggests broad applicability to other speech tasks using self-supervised representations.

Abstract

Speech Emotion Recognition (SER) involves analyzing vocal expressions to determine the emotional state of speakers, where the comprehensive and thorough utilization of audio information is paramount. Therefore, we propose a novel approach on self-supervised learning (SSL) models that employs all available auxiliary information -- specifically metadata -- to enhance performance. Through a two-stage fine-tuning method in multi-task learning, we introduce the Augmented Residual Integration (ARI) module, which enhances transformer layers in encoder of SSL models. The module efficiently preserves acoustic features across all different levels, thereby significantly improving the performance of metadata-related auxiliary tasks that require various levels of features. Moreover, the Co-attention module is incorporated due to its complementary nature with ARI, enabling the model to effectively utilize multidimensional information and contextual relationships from metadata-related auxiliary tasks. Under pre-trained base models and speaker-independent setup, our approach consistently surpasses state-of-the-art (SOTA) models on multiple SSL encoders for the IEMOCAP dataset.
Paper Structure (17 sections, 3 equations, 1 figure, 4 tables)

This paper contains 17 sections, 3 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: The proposed model is trained in two stages, the first stage trains the auxiliary tasks and the second stage trains the SER with the auxiliary task information. Note: All Transformer layers are fine-tuned in stage 1, but the first four layers are frozen in stage 2.