Pre-Finetuning for Few-Shot Emotional Speech Recognition

Maximillian Chen; Zhou Yu

Pre-Finetuning for Few-Shot Emotional Speech Recognition

Maximillian Chen, Zhou Yu

TL;DR

This work proposes pre-finetuning speech models on difficult tasks to distill knowledge into few-shot downstream classification objectives and proposes investigating transfer learning approaches inspired by recent success with pre-trained models in natural language tasks.

Abstract

Speech models have long been known to overfit individual speakers for many classification tasks. This leads to poor generalization in settings where the speakers are out-of-domain or out-of-distribution, as is common in production environments. We view speaker adaptation as a few-shot learning problem and propose investigating transfer learning approaches inspired by recent success with pre-trained models in natural language tasks. We propose pre-finetuning speech models on difficult tasks to distill knowledge into few-shot downstream classification objectives. We pre-finetune Wav2Vec2.0 on every permutation of four multiclass emotional speech recognition corpora and evaluate our pre-finetuned models through 33,600 few-shot fine-tuning trials on the Emotional Speech Dataset.

Pre-Finetuning for Few-Shot Emotional Speech Recognition

TL;DR

Abstract

Paper Structure (14 sections, 4 figures, 1 table)

This paper contains 14 sections, 4 figures, 1 table.

Introduction
Related Work
Methodology
Corpora Selection
Pre-Finetuning in Speech
Downstream Finetuning
Experimental Results
Effect of Number of Pre-Finetuning Corpora
Ablation on Individual Corpus Contributions
Ablation on Pre-Finetuning Corpus Inclusion
Scaling Downstream Training Data Sizes
Discussion
Conclusion
Acknowledgements

Figures (4)

Figure 1: Workflow of pre-finetuning an emotion recognition model. Wav2Vec2.0 is initialized with a separate linear classification head for each pre-finetuning dataset in order to ensure the correct output space. Pre-finetuning tasks are continuously randomly sampled, and each instance is mapped to the corresponding classification head. Each task's loss is computed separately and averaged during validation.
Figure 2: Comparison of downstream task performance of models pre-finetuned on varying numbers of corpora. Each line depicts change in mean and standard error of F1 Macro.
Figure 3: Average difference in Macro F1 resulting from pre-finetuning on each corpus compared to the Wav2Vec2.0 baseline. Differences shown are aggregations controlling for the number of few-shot examples, each speaker, and each emotion.
Figure 4: Effect of number of training examples using during fine-tuning for the baseline model with no pre-finetuning (No PFT), and the model pre-finetuned on all four corpora (All PFT). Results are stratified by emotion. Left: classification results on native English speech. Right: classification results on native Mandarin speech.

Pre-Finetuning for Few-Shot Emotional Speech Recognition

TL;DR

Abstract

Pre-Finetuning for Few-Shot Emotional Speech Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (4)