Table of Contents
Fetching ...

Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study

Lucky Onyekwelu-Udoka, Md Shafiqul Islam, Md Shahedul Hasan

TL;DR

This work addresses the need for accurate SER on resource-constrained devices by comparing lightweight transformer models (DistilHuBERT) and spectrogram-based transformers (PaSST) against a CNN–LSTM baseline on the CREMA-D dataset with a speaker-independent 70/15/15 split and data augmentation. DistilHuBERT achieves the highest overall accuracy of 70.64% and F1 of 70.36% while having an extremely small footprint (0.02 MB), outperforming PaSST and the CNN–LSTM baseline (61.07%). An ablation study on PaSST variants (Linear, MLP, Attentive Pooling) shows the MLP head yields the best performance among PaSST configurations but remains below DistilHuBERT, with Linear performing the worst. The findings suggest that self-supervised waveform models are well-suited for edge-enabled SER, while spectrogram-based transformers require careful head design and pretraining to approach waveform-based models; future work points to multimodal cues and emotion-aware fine-tuning strategies.

Abstract

Emotion recognition from speech plays a vital role in the development of empathetic human-computer interaction systems. This paper presents a comparative analysis of lightweight transformer-based models, DistilHuBERT and PaSST, by classifying six core emotions from the CREMA-D dataset. We benchmark their performance against a traditional CNN-LSTM baseline model using MFCC features. DistilHuBERT demonstrates superior accuracy (70.64%) and F1 score (70.36%) while maintaining an exceptionally small model size (0.02 MB), outperforming both PaSST and the baseline. Furthermore, we conducted an ablation study on three variants of the PaSST, Linear, MLP, and Attentive Pooling heads, to understand the effect of classification head architecture on model performance. Our results indicate that PaSST with an MLP head yields the best performance among its variants but still falls short of DistilHuBERT. Among the emotion classes, angry is consistently the most accurately detected, while disgust remains the most challenging. These findings suggest that lightweight transformers like DistilHuBERT offer a compelling solution for real-time speech emotion recognition on edge devices. The code is available at: https://github.com/luckymaduabuchi/Emotion-detection-.

Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study

TL;DR

This work addresses the need for accurate SER on resource-constrained devices by comparing lightweight transformer models (DistilHuBERT) and spectrogram-based transformers (PaSST) against a CNN–LSTM baseline on the CREMA-D dataset with a speaker-independent 70/15/15 split and data augmentation. DistilHuBERT achieves the highest overall accuracy of 70.64% and F1 of 70.36% while having an extremely small footprint (0.02 MB), outperforming PaSST and the CNN–LSTM baseline (61.07%). An ablation study on PaSST variants (Linear, MLP, Attentive Pooling) shows the MLP head yields the best performance among PaSST configurations but remains below DistilHuBERT, with Linear performing the worst. The findings suggest that self-supervised waveform models are well-suited for edge-enabled SER, while spectrogram-based transformers require careful head design and pretraining to approach waveform-based models; future work points to multimodal cues and emotion-aware fine-tuning strategies.

Abstract

Emotion recognition from speech plays a vital role in the development of empathetic human-computer interaction systems. This paper presents a comparative analysis of lightweight transformer-based models, DistilHuBERT and PaSST, by classifying six core emotions from the CREMA-D dataset. We benchmark their performance against a traditional CNN-LSTM baseline model using MFCC features. DistilHuBERT demonstrates superior accuracy (70.64%) and F1 score (70.36%) while maintaining an exceptionally small model size (0.02 MB), outperforming both PaSST and the baseline. Furthermore, we conducted an ablation study on three variants of the PaSST, Linear, MLP, and Attentive Pooling heads, to understand the effect of classification head architecture on model performance. Our results indicate that PaSST with an MLP head yields the best performance among its variants but still falls short of DistilHuBERT. Among the emotion classes, angry is consistently the most accurately detected, while disgust remains the most challenging. These findings suggest that lightweight transformers like DistilHuBERT offer a compelling solution for real-time speech emotion recognition on edge devices. The code is available at: https://github.com/luckymaduabuchi/Emotion-detection-.

Paper Structure

This paper contains 1 section, 10 equations, 6 figures, 3 tables.

Table of Contents

  1. Introduction

Figures (6)

  • Figure 1: DistilHuBERT architecture overview.
  • Figure 2: PaSST
  • Figure 3: Overall training pipeline for DistilHuBERT and PaSST models on the CREMA-D dataset.
  • Figure 4: Confusion matrices of PaSST-MLP and PaSST-Attention showing per-emotion classification performance on the CREMA-D dataset.
  • Figure 5: Confusion matrices of DistilHuBERT and PaSST-Linear showing per-emotion classification performance on the CREMA-D dataset.
  • ...and 1 more figures