Table of Contents
Fetching ...

PEFT-SER: On the Use of Parameter Efficient Transfer Learning Approaches For Speech Emotion Recognition Using Pre-trained Speech Models

Tiantian Feng, Shrikanth Narayanan

TL;DR

The paper addresses the practical challenge of fine-tuning large pre-trained speech models for speech emotion recognition (SER) by investigating parameter-efficient transfer learning (PEFT) methods. It evaluates three PEFT approaches—adapter tuning, embedding prompt tuning, and LoRa—across four popular SER datasets (IEMOCAP, CREMA-D, MSP-Improv, MSP-Podcast) using multiple backbones (Whisper, Wav2vec 2.0, WavLM) with frozen encoders and a consistent downstream classifier. Results show that LoRa generally delivers the strongest SER performance with the fewest extra parameters, particularly on the WavLM Base+ backbone, and it often improves fairness metrics compared to fully fine-tuned baselines. The work provides practical insights for deploying efficient and fair SER systems and releases code and models to support further research in PEFT for speech tasks, with future directions including multimodal SER and edge-deployment considerations.

Abstract

Many recent studies have focused on fine-tuning pre-trained models for speech emotion recognition (SER), resulting in promising performance compared to traditional methods that rely largely on low-level, knowledge-inspired acoustic features. These pre-trained speech models learn general-purpose speech representations using self-supervised or weakly-supervised learning objectives from large-scale datasets. Despite the significant advances made in SER through the use of pre-trained architecture, fine-tuning these large pre-trained models for different datasets requires saving copies of entire weight parameters, rendering them impractical to deploy in real-world settings. As an alternative, this work explores parameter-efficient fine-tuning (PEFT) approaches for adapting pre-trained speech models for emotion recognition. Specifically, we evaluate the efficacy of adapter tuning, embedding prompt tuning, and LoRa (Low-rank approximation) on four popular SER testbeds. Our results reveal that LoRa achieves the best fine-tuning performance in emotion recognition while enhancing fairness and requiring only a minimal extra amount of weight parameters. Furthermore, our findings offer novel insights into future research directions in SER, distinct from existing approaches focusing on directly fine-tuning the model architecture. Our code is publicly available under: https://github.com/usc-sail/peft-ser.

PEFT-SER: On the Use of Parameter Efficient Transfer Learning Approaches For Speech Emotion Recognition Using Pre-trained Speech Models

TL;DR

The paper addresses the practical challenge of fine-tuning large pre-trained speech models for speech emotion recognition (SER) by investigating parameter-efficient transfer learning (PEFT) methods. It evaluates three PEFT approaches—adapter tuning, embedding prompt tuning, and LoRa—across four popular SER datasets (IEMOCAP, CREMA-D, MSP-Improv, MSP-Podcast) using multiple backbones (Whisper, Wav2vec 2.0, WavLM) with frozen encoders and a consistent downstream classifier. Results show that LoRa generally delivers the strongest SER performance with the fewest extra parameters, particularly on the WavLM Base+ backbone, and it often improves fairness metrics compared to fully fine-tuned baselines. The work provides practical insights for deploying efficient and fair SER systems and releases code and models to support further research in PEFT for speech tasks, with future directions including multimodal SER and edge-deployment considerations.

Abstract

Many recent studies have focused on fine-tuning pre-trained models for speech emotion recognition (SER), resulting in promising performance compared to traditional methods that rely largely on low-level, knowledge-inspired acoustic features. These pre-trained speech models learn general-purpose speech representations using self-supervised or weakly-supervised learning objectives from large-scale datasets. Despite the significant advances made in SER through the use of pre-trained architecture, fine-tuning these large pre-trained models for different datasets requires saving copies of entire weight parameters, rendering them impractical to deploy in real-world settings. As an alternative, this work explores parameter-efficient fine-tuning (PEFT) approaches for adapting pre-trained speech models for emotion recognition. Specifically, we evaluate the efficacy of adapter tuning, embedding prompt tuning, and LoRa (Low-rank approximation) on four popular SER testbeds. Our results reveal that LoRa achieves the best fine-tuning performance in emotion recognition while enhancing fairness and requiring only a minimal extra amount of weight parameters. Furthermore, our findings offer novel insights into future research directions in SER, distinct from existing approaches focusing on directly fine-tuning the model architecture. Our code is publicly available under: https://github.com/usc-sail/peft-ser.
Paper Structure (17 sections, 3 equations, 7 figures, 6 tables)

This paper contains 17 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: System architecture of different parameter-efficient fine-tuning (PEFT) approaches used in this study.
  • Figure 2: Modeling framework used in this work. The pre-trained models shown in the diagram include Wav2vec 2.0 Base, WavLM Base+, and Whisper models.
  • Figure 3: Performance with fine-tuning downstream classification model (pre-trained model frozen during training) for SER.
  • Figure 4: SER performance varying embedding prompt sizes.
  • Figure 5: SER performance with different bottleneck size.
  • ...and 2 more figures