Improvement and Implementation of a Speech Emotion Recognition Model Based on Dual-Layer LSTM
Xiaoran Yang, Shuhan Yu, Wenxi Xu
TL;DR
The paper addresses improving accuracy and real-time performance in speech emotion recognition by augmenting a single-layer LSTM with an additional layer. It implements a dual-layer LSTM architecture with two 128-unit layers on MFCC features derived from the RAVDESS dataset, followed by a dense layer and Softmax classifier, trained with cross-entropy loss using the Adam optimizer. Results show a 2 percentage point improvement in accuracy over the single-layer baseline and reduced latency, achieving 87.33% accuracy on the eight-emotion subset of RAVDESS. The work has practical implications for intelligent customer service and human-computer interaction, and it discusses future directions such as weight pruning, quantization, and exploring Transformer-inspired hybrids for real-time SER systems.
Abstract
This paper builds upon an existing speech emotion recognition model by adding an additional LSTM layer to improve the accuracy and processing efficiency of emotion recognition from audio data. By capturing the long-term dependencies within audio sequences through a dual-layer LSTM network, the model can recognize and classify complex emotional patterns more accurately. Experiments conducted on the RAVDESS dataset validated this approach, showing that the modified dual layer LSTM model improves accuracy by 2% compared to the single-layer LSTM while significantly reducing recognition latency, thereby enhancing real-time performance. These results indicate that the dual-layer LSTM architecture is highly suitable for handling emotional features with long-term dependencies, providing a viable optimization for speech emotion recognition systems. This research provides a reference for practical applications in fields like intelligent customer service, sentiment analysis and human-computer interaction.
