Table of Contents
Fetching ...

Enhancing Speech Emotion Recognition via Fine-Tuning Pre-Trained Models and Hyper-Parameter Optimisation

Aryan Golbaghi, Shuo Zhou

TL;DR

The paper addresses efficient speech emotion recognition on commodity hardware by combining a pre-trained wav2vec 2.0 encoder (fine-tuned on IEMOCAP) with automated hyperparameter optimization. It evaluates Gaussian Process Bayesian Optimization (GP-BO) and Tree-structured Parzen Estimators (TPE) within a CPU-only AutoML workflow, achieving rapid, high-accuracy results on EmoDB and strong cross-lingual generalization to CREMA-D and RAVDESS after fine-tuning. GP-BO delivers the best speed–accuracy trade-off, while TPE can yield the highest BCA within similar budgets, surpassing the AutoSpeech 2020 GPU baseline. The work demonstrates that efficient HPO with pre-trained encoders enables competitive SER on CPUs and highlights promising cross-lingual transfer when tuned on EmoDB. Future directions include multilingual corpora, parameter-efficient tuning, and deployment-aware energy/latency considerations.

Abstract

We propose a workflow for speech emotion recognition (SER) that combines pre-trained representations with automated hyperparameter optimisation (HPO). Using SpeechBrain wav2vec2-base model fine-tuned on IEMOCAP as the encoder, we compare two HPO strategies, Gaussian Process Bayesian Optimisation (GP-BO) and Tree-structured Parzen Estimators (TPE), under an identical four-dimensional search space and 15-trial budget, with balanced class accuracy (BCA) on the German EmoDB corpus as the objective. All experiments run on 8 CPU cores with 32 GB RAM. GP-BO achieves 0.96 BCA in 11 minutes, and TPE (Hyperopt implementation) attains 0.97 in 15 minutes. In contrast, grid search requires 143 trials and 1,680 minutes to exceed 0.9 BCA, and the best AutoSpeech 2020 baseline reports only 0.85 in 30 minutes on GPU. For cross-lingual generalisation, an EmoDB-trained HPO-tuned model improves zero-shot accuracy by 0.25 on CREMA-D and 0.26 on RAVDESS. Results show that efficient HPO with pre-trained encoders delivers competitive SER on commodity CPUs. Source code to this work is available at: https://github.com/youngaryan/speechbrain-emotion-hpo.

Enhancing Speech Emotion Recognition via Fine-Tuning Pre-Trained Models and Hyper-Parameter Optimisation

TL;DR

The paper addresses efficient speech emotion recognition on commodity hardware by combining a pre-trained wav2vec 2.0 encoder (fine-tuned on IEMOCAP) with automated hyperparameter optimization. It evaluates Gaussian Process Bayesian Optimization (GP-BO) and Tree-structured Parzen Estimators (TPE) within a CPU-only AutoML workflow, achieving rapid, high-accuracy results on EmoDB and strong cross-lingual generalization to CREMA-D and RAVDESS after fine-tuning. GP-BO delivers the best speed–accuracy trade-off, while TPE can yield the highest BCA within similar budgets, surpassing the AutoSpeech 2020 GPU baseline. The work demonstrates that efficient HPO with pre-trained encoders enables competitive SER on CPUs and highlights promising cross-lingual transfer when tuned on EmoDB. Future directions include multilingual corpora, parameter-efficient tuning, and deployment-aware energy/latency considerations.

Abstract

We propose a workflow for speech emotion recognition (SER) that combines pre-trained representations with automated hyperparameter optimisation (HPO). Using SpeechBrain wav2vec2-base model fine-tuned on IEMOCAP as the encoder, we compare two HPO strategies, Gaussian Process Bayesian Optimisation (GP-BO) and Tree-structured Parzen Estimators (TPE), under an identical four-dimensional search space and 15-trial budget, with balanced class accuracy (BCA) on the German EmoDB corpus as the objective. All experiments run on 8 CPU cores with 32 GB RAM. GP-BO achieves 0.96 BCA in 11 minutes, and TPE (Hyperopt implementation) attains 0.97 in 15 minutes. In contrast, grid search requires 143 trials and 1,680 minutes to exceed 0.9 BCA, and the best AutoSpeech 2020 baseline reports only 0.85 in 30 minutes on GPU. For cross-lingual generalisation, an EmoDB-trained HPO-tuned model improves zero-shot accuracy by 0.25 on CREMA-D and 0.26 on RAVDESS. Results show that efficient HPO with pre-trained encoders delivers competitive SER on commodity CPUs. Source code to this work is available at: https://github.com/youngaryan/speechbrain-emotion-hpo.

Paper Structure

This paper contains 13 sections, 7 equations, 1 figure, 6 tables, 1 algorithm.

Figures (1)

  • Figure 1: Proposed AutoML fine-tuning workflow for SER. A pre-trained model is fine-tuned on $\mathcal{D}_{\text{train}}$, with HPO guiding hyperparameter search and $M_{\text{val}}$ used for model selection.