Table of Contents
Fetching ...

Speech Emotion Recognition Leveraging OpenAI's Whisper Representations and Attentive Pooling Methods

Ali Shendabadi, Parnia Izadirad, Mostafa Salehi, Mahmoud Bijankhan

TL;DR

The paper tackles SER under data scarcity by leveraging Whisper-derived representations and two attention-based pooling methods to compress high-dimensional features without losing emotional cues. It demonstrates that Multi-head QKV Attention Pooling, particularly with Whisper Small, can achieve state-of-the-art unweighted accuracy on ShEMO and compete closely on IEMOCAP while offering substantial efficiency gains over larger models like HuBERT X-Large. The study also reveals language-specific insights, showing intermediate Whisper layers can be more informative for Persian SER, and underscores Whisper’s potential as a lightweight, multilingual representation extractor for SER. Overall, the approach provides a practical, scalable path for SER in low-resource languages and resource-constrained deployment scenarios, with strong architecture- and dataset-dependent observations."

Abstract

Speech Emotion Recognition (SER) research has faced limitations due to the lack of standard and sufficiently large datasets. Recent studies have leveraged pre-trained models to extract features for downstream tasks such as SER. This work explores the capabilities of Whisper, a pre-trained ASR system, in speech emotion recognition by proposing two attention-based pooling methods, Multi-head Attentive Average Pooling and QKV Pooling, designed to efficiently reduce the dimensionality of Whisper representations while preserving emotional features. We experiment on English and Persian, using the IEMOCAP and ShEMO datasets respectively, with Whisper Tiny and Small. Our multi-head QKV architecture achieves state-of-the-art results on the ShEMO dataset, with a 2.47% improvement in unweighted accuracy. We further compare the performance of different Whisper encoder layers and find that intermediate layers often perform better for SER on the Persian dataset, providing a lightweight and efficient alternative to much larger models such as HuBERT X-Large. Our findings highlight the potential of Whisper as a representation extractor for SER and demonstrate the effectiveness of attention-based pooling for dimension reduction.

Speech Emotion Recognition Leveraging OpenAI's Whisper Representations and Attentive Pooling Methods

TL;DR

The paper tackles SER under data scarcity by leveraging Whisper-derived representations and two attention-based pooling methods to compress high-dimensional features without losing emotional cues. It demonstrates that Multi-head QKV Attention Pooling, particularly with Whisper Small, can achieve state-of-the-art unweighted accuracy on ShEMO and compete closely on IEMOCAP while offering substantial efficiency gains over larger models like HuBERT X-Large. The study also reveals language-specific insights, showing intermediate Whisper layers can be more informative for Persian SER, and underscores Whisper’s potential as a lightweight, multilingual representation extractor for SER. Overall, the approach provides a practical, scalable path for SER in low-resource languages and resource-constrained deployment scenarios, with strong architecture- and dataset-dependent observations."

Abstract

Speech Emotion Recognition (SER) research has faced limitations due to the lack of standard and sufficiently large datasets. Recent studies have leveraged pre-trained models to extract features for downstream tasks such as SER. This work explores the capabilities of Whisper, a pre-trained ASR system, in speech emotion recognition by proposing two attention-based pooling methods, Multi-head Attentive Average Pooling and QKV Pooling, designed to efficiently reduce the dimensionality of Whisper representations while preserving emotional features. We experiment on English and Persian, using the IEMOCAP and ShEMO datasets respectively, with Whisper Tiny and Small. Our multi-head QKV architecture achieves state-of-the-art results on the ShEMO dataset, with a 2.47% improvement in unweighted accuracy. We further compare the performance of different Whisper encoder layers and find that intermediate layers often perform better for SER on the Persian dataset, providing a lightweight and efficient alternative to much larger models such as HuBERT X-Large. Our findings highlight the potential of Whisper as a representation extractor for SER and demonstrate the effectiveness of attention-based pooling for dimension reduction.
Paper Structure (20 sections, 9 equations, 5 figures, 4 tables)

This paper contains 20 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Demonstration of the Multi-head Attentive Average Pooling and Multi-head QKV Pooling pipeline for SER. After extracting speech representations using Whisper encoders, a multi-head pooling method is applied to reduce the dimensionality of the representation matrix. The stacked rectangles represent attention heads. The outputs from all attention heads are concatenated and subsequently projected through a weight matrix ${W^o}$, producing a final 256-dimensional vector that serves as the input to the classifier.
  • Figure 2: Performance comparison of Whisper Tiny and Whisper Small encoders for Multi-head Attentive Average Pooling (AttW) and Multi-head QKV Pooling (QKV) methods on ShEMO. The radar chart shows that Whisper Small consistently outperforms Whisper Tiny, with particularly notable improvements in categories with fewer samples
  • Figure 3: Confusion matrices of classifying ShEMO using different sizes of Whisper for representation extraction.
  • Figure 4: Comparing learning speed using different layers of whisper encoder as representation in ShEMO
  • Figure 5: Performance comparison of Mean Average Pooling (Mean), Multi-head Attentive Average Pooling (Attentive) and Multi-head QKV Pooling (QKV) when using 4 different layers of Whisper Small Encoders.