Emotional Vietnamese Speech-Based Depression Diagnosis Using Dynamic Attention Mechanism
Quang-Anh N. D., Manh-Hung Ha, Thai Kim Dinh, Minh-Duc Pham, Ninh Nguyen Van
TL;DR
This paper addresses depression-detection from Vietnamese speech by leveraging a Dynamic-CBAM within an Attention-GRU framework to extract emotion-related cues from audio. The Dynamic-CBAM integrates Omni-Dimensional Dynamic Convolution (ODConv) and is applied in a dual-stream architecture that processes raw waveform and MFCC features, achieving robust performance on the VNEMOS dataset. The proposed approach yields UA 0.87, WA 0.86, and F1 0.87, outperforming several baselines and demonstrating the effectiveness of dynamic attention mechanisms for emotion recognition in speech. The work offers a practical path toward early depression screening using accessible acoustic data, with MFCC-focused processing providing potential computational advantages and flexibility across languages.
Abstract
Major depressive disorder is a prevalent and serious mental health condition that negatively impacts your emotions, thoughts, actions, and overall perception of the world. It is complicated to determine whether a person is depressed due to the symptoms of depression not apparent. However, their voice can be one of the factor from which we can acknowledge signs of depression. People who are depressed express discomfort, sadness and they may speak slowly, trembly, and lose emotion in their voices. In this study, we proposed the Dynamic Convolutional Block Attention Module (Dynamic-CBAM) to utilized with in an Attention-GRU Network to classify the emotions by analyzing the audio signal of humans. Based on the results, we can diagnose which patients are depressed or prone to depression then so that treatment and prevention can be started as soon as possible. The research delves into the intricate computational steps involved in implementing a Attention-GRU deep learning architecture. Through experimentation, the model has achieved an impressive recognition with Unweighted Accuracy (UA) rate of 0.87 and 0.86 Weighted Accuracy (WA) rate and F1 rate of 0.87 in the VNEMOS dataset. Training code is released in https://github.com/fiyud/Emotional-Vietnamese-Speech-Based-Depression-Diagnosis-Using-Dynamic-Attention-Mechanism
