Table of Contents
Fetching ...

Deep Learning for Speech Emotion Recognition: A CNN Approach Utilizing Mel Spectrograms

Niketa Penumajji

TL;DR

This work addresses the need for practical speech emotion recognition in learning environments by transforming audio into mel-spectrogram images and applying a CNN to learn emotional patterns. It constructs a complete pipeline from data: selecting the RAVDESS dataset, converting audio to mel-scale spectrograms using STFT, and training a 4-layer CNN with 125 epochs, achieving 68.88% accuracy on a blind test. A lightweight Tkinter GUI enables non-technical users to run predictions in real time, illustrating feasibility for classroom use and educational applications. While real-world classroom testing was limited by the pandemic, the results and the accompanying GUI demonstrate the approach's potential for inferring affective states and guiding learning experiences in open-ended environments.

Abstract

This paper explores the application of Convolutional Neural Networks CNNs for classifying emotions in speech through Mel Spectrogram representations of audio files. Traditional methods such as Gaussian Mixture Models and Hidden Markov Models have proven insufficient for practical deployment, prompting a shift towards deep learning techniques. By transforming audio data into a visual format, the CNN model autonomously learns to identify intricate patterns, enhancing classification accuracy. The developed model is integrated into a user-friendly graphical interface, facilitating realtime predictions and potential applications in educational environments. The study aims to advance the understanding of deep learning in speech emotion recognition, assess the models feasibility, and contribute to the integration of technology in learning contexts

Deep Learning for Speech Emotion Recognition: A CNN Approach Utilizing Mel Spectrograms

TL;DR

This work addresses the need for practical speech emotion recognition in learning environments by transforming audio into mel-spectrogram images and applying a CNN to learn emotional patterns. It constructs a complete pipeline from data: selecting the RAVDESS dataset, converting audio to mel-scale spectrograms using STFT, and training a 4-layer CNN with 125 epochs, achieving 68.88% accuracy on a blind test. A lightweight Tkinter GUI enables non-technical users to run predictions in real time, illustrating feasibility for classroom use and educational applications. While real-world classroom testing was limited by the pandemic, the results and the accompanying GUI demonstrate the approach's potential for inferring affective states and guiding learning experiences in open-ended environments.

Abstract

This paper explores the application of Convolutional Neural Networks CNNs for classifying emotions in speech through Mel Spectrogram representations of audio files. Traditional methods such as Gaussian Mixture Models and Hidden Markov Models have proven insufficient for practical deployment, prompting a shift towards deep learning techniques. By transforming audio data into a visual format, the CNN model autonomously learns to identify intricate patterns, enhancing classification accuracy. The developed model is integrated into a user-friendly graphical interface, facilitating realtime predictions and potential applications in educational environments. The study aims to advance the understanding of deep learning in speech emotion recognition, assess the models feasibility, and contribute to the integration of technology in learning contexts

Paper Structure

This paper contains 19 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Spectrogram comparison
  • Figure 2: An example audio file from the dataset; only the sections marked in green (emotion and gender) need to be taken into consideration.
  • Figure 3: A log-scaled mel spectrogram obtained from an audio file from the dataset
  • Figure 4: Figure illustrates the importance of decibel conversion
  • Figure 5: Summary of the complete speech emotion recognition model
  • ...and 3 more figures