Table of Contents
Fetching ...

ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets

Shahin Amiriparian, Filip Packań, Maurice Gerczuk, Björn W. Schuller

TL;DR

This work tackles cross-linguistic and cross-domain speech emotion recognition by combining a large, diverse training corpus with a tailored model extension. EmoSet++ aggregates 37 datasets into 150,907 samples (119.5 hours), and ExHuBERT extends HuBERT by duplicating encoder layers with a fixed first copy and introducing a zero-initialized linear layer and skip connections to facilitate fine-tuning. By fine-tuning the enlarged backbone on EmoSet++, ExHuBERT achieves superior performance on unseen SER datasets, establishing a new benchmark across languages and domains. The approach provides a practical path to robust, multilingual SER deployment, and details are available at HuggingFace.

Abstract

Foundation models have shown great promise in speech emotion recognition (SER) by leveraging their pre-trained representations to capture emotion patterns in speech signals. To further enhance SER performance across various languages and domains, we propose a novel twofold approach. First, we gather EmoSet++, a comprehensive multi-lingual, multi-cultural speech emotion corpus with 37 datasets, 150,907 samples, and a total duration of 119.5 hours. Second, we introduce ExHuBERT, an enhanced version of HuBERT achieved by backbone extension and fine-tuning on EmoSet++. We duplicate each encoder layer and its weights, then freeze the first duplicate, integrating an extra zero-initialized linear layer and skip connections to preserve functionality and ensure its adaptability for subsequent fine-tuning. Our evaluation on unseen datasets shows the efficacy of ExHuBERT, setting a new benchmark for various SER tasks. Model and details on EmoSet++: https://huggingface.co/amiriparian/ExHuBERT.

ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets

TL;DR

This work tackles cross-linguistic and cross-domain speech emotion recognition by combining a large, diverse training corpus with a tailored model extension. EmoSet++ aggregates 37 datasets into 150,907 samples (119.5 hours), and ExHuBERT extends HuBERT by duplicating encoder layers with a fixed first copy and introducing a zero-initialized linear layer and skip connections to facilitate fine-tuning. By fine-tuning the enlarged backbone on EmoSet++, ExHuBERT achieves superior performance on unseen SER datasets, establishing a new benchmark across languages and domains. The approach provides a practical path to robust, multilingual SER deployment, and details are available at HuggingFace.

Abstract

Foundation models have shown great promise in speech emotion recognition (SER) by leveraging their pre-trained representations to capture emotion patterns in speech signals. To further enhance SER performance across various languages and domains, we propose a novel twofold approach. First, we gather EmoSet++, a comprehensive multi-lingual, multi-cultural speech emotion corpus with 37 datasets, 150,907 samples, and a total duration of 119.5 hours. Second, we introduce ExHuBERT, an enhanced version of HuBERT achieved by backbone extension and fine-tuning on EmoSet++. We duplicate each encoder layer and its weights, then freeze the first duplicate, integrating an extra zero-initialized linear layer and skip connections to preserve functionality and ensure its adaptability for subsequent fine-tuning. Our evaluation on unseen datasets shows the efficacy of ExHuBERT, setting a new benchmark for various SER tasks. Model and details on EmoSet++: https://huggingface.co/amiriparian/ExHuBERT.
Paper Structure (1 section, 1 table)

This paper contains 1 section, 1 table.

Table of Contents