Table of Contents
Fetching ...

Efficient Finetuning for Dimensional Speech Emotion Recognition in the Age of Transformers

Aneesha Sampath, James Tavernor, Emily Mower Provost

TL;DR

The paper tackles the high computational cost of finetuning large pretrained transformers for dimensional speech emotion recognition. It systematically evaluates partial finetuning of transformer layers, mixed precision training, caching of intermediate representations, and LoRA on the Wav2Vec 2.0 base model. Key findings show that finetuning the final three layers in mixed precision achieves comparable performance to full finetuning with a $67\%$ speedup, while caching further accelerates training to $88\%$ faster and reduces trainable parameters by up to $71\%$, enabling training on lower-memory GPUs. These results provide practical guidelines for making accurate dimensional speech emotion recognition more accessible and scalable across researchers and practitioners, with potential validation on alternative models like WavLM and HuBERT.

Abstract

Accurate speech emotion recognition is essential for developing human-facing systems. Recent advancements have included finetuning large, pretrained transformer models like Wav2Vec 2.0. However, the finetuning process requires substantial computational resources, including high-memory GPUs and significant processing time. As the demand for accurate emotion recognition continues to grow, efficient finetuning approaches are needed to reduce the computational burden. Our study focuses on dimensional emotion recognition, predicting attributes such as activation (calm to excited) and valence (negative to positive). We present various finetuning techniques, including full finetuning, partial finetuning of transformer layers, finetuning with mixed precision, partial finetuning with caching, and low-rank adaptation (LoRA) on the Wav2Vec 2.0 base model. We find that partial finetuning with mixed precision achieves performance comparable to full finetuning while increasing training speed by 67%. Caching intermediate representations further boosts efficiency, yielding an 88% speedup and a 71% reduction in learnable parameters. We recommend finetuning the final three transformer layers in mixed precision to balance performance and training efficiency, and adding intermediate representation caching for optimal speed with minimal performance trade-offs. These findings lower the barriers to finetuning speech emotion recognition systems, making accurate emotion recognition more accessible to a broader range of researchers and practitioners.

Efficient Finetuning for Dimensional Speech Emotion Recognition in the Age of Transformers

TL;DR

The paper tackles the high computational cost of finetuning large pretrained transformers for dimensional speech emotion recognition. It systematically evaluates partial finetuning of transformer layers, mixed precision training, caching of intermediate representations, and LoRA on the Wav2Vec 2.0 base model. Key findings show that finetuning the final three layers in mixed precision achieves comparable performance to full finetuning with a speedup, while caching further accelerates training to faster and reduces trainable parameters by up to , enabling training on lower-memory GPUs. These results provide practical guidelines for making accurate dimensional speech emotion recognition more accessible and scalable across researchers and practitioners, with potential validation on alternative models like WavLM and HuBERT.

Abstract

Accurate speech emotion recognition is essential for developing human-facing systems. Recent advancements have included finetuning large, pretrained transformer models like Wav2Vec 2.0. However, the finetuning process requires substantial computational resources, including high-memory GPUs and significant processing time. As the demand for accurate emotion recognition continues to grow, efficient finetuning approaches are needed to reduce the computational burden. Our study focuses on dimensional emotion recognition, predicting attributes such as activation (calm to excited) and valence (negative to positive). We present various finetuning techniques, including full finetuning, partial finetuning of transformer layers, finetuning with mixed precision, partial finetuning with caching, and low-rank adaptation (LoRA) on the Wav2Vec 2.0 base model. We find that partial finetuning with mixed precision achieves performance comparable to full finetuning while increasing training speed by 67%. Caching intermediate representations further boosts efficiency, yielding an 88% speedup and a 71% reduction in learnable parameters. We recommend finetuning the final three transformer layers in mixed precision to balance performance and training efficiency, and adding intermediate representation caching for optimal speed with minimal performance trade-offs. These findings lower the barriers to finetuning speech emotion recognition systems, making accurate emotion recognition more accessible to a broader range of researchers and practitioners.

Paper Structure

This paper contains 20 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Non-caching vs caching approach for partial finetuning (three-layer). The blue layers with the snowflake are frozen layers, whereas the orange layers with the fire are trainable layers. 'FC' represents fully connected layers.