Investigating Acoustic-Textual Emotional Inconsistency Information for Automatic Depression Detection
Rongfeng Su, Changqing Xu, Xinyi Wu, Feng Xu, Xie Chen, Lan Wangt, Nan Yan
TL;DR
This work tackles automatic depression detection from speech by leveraging emotional inconsistency across acoustic and textual modalities, grounded in Emotion Context-Insensitivity theory. It introduces Acoustic-Textual Emotional Inconsistency (ATEI) extracted via a multimodal cross-attention Transformer, combined with a learnable fusion scaling to integrate ATEI with SSL-based acoustic and textual features from counseling conversations. The study demonstrates that embedding-based ATEI representations, particularly when fused with acoustic and textual features through concatenation and with scaling, yield substantial gains over state-of-the-art baselines, achieving up to around 81% subject-level accuracy and improved separation of depression severities. These results underscore the value of cross-modal emotional inconsistency as a diagnostic signal and open avenues for severity-aware, data-efficient depression detection in practical settings.
Abstract
Previous studies have demonstrated that emotional features from a single acoustic sentiment label can enhance depression diagnosis accuracy. Additionally, according to the Emotion Context-Insensitivity theory and our pilot study, individuals with depression might convey negative emotional content in an unexpectedly calm manner, showing a high degree of inconsistency in emotional expressions during natural conversations. So far, few studies have recognized and leveraged the emotional expression inconsistency for depression detection. In this paper, a multimodal cross-attention method is presented to capture the Acoustic-Textual Emotional Inconsistency (ATEI) information. This is achieved by analyzing the intricate local and long-term dependencies of emotional expressions across acoustic and textual domains, as well as the mismatch between the emotional content within both domains. A Transformer-based model is then proposed to integrate this ATEI information with various fusion strategies for detecting depression. Furthermore, a scaling technique is employed to adjust the ATEI feature degree during the fusion process, thereby enhancing the model's ability to discern patients with depression across varying levels of severity. To best of our knowledge, this work is the first to incorporate emotional expression inconsistency information into depression detection. Experimental results on a counseling conversational dataset illustrate the effectiveness of our method.
