Table of Contents
Fetching ...

Investigating Acoustic-Textual Emotional Inconsistency Information for Automatic Depression Detection

Rongfeng Su, Changqing Xu, Xinyi Wu, Feng Xu, Xie Chen, Lan Wangt, Nan Yan

TL;DR

This work tackles automatic depression detection from speech by leveraging emotional inconsistency across acoustic and textual modalities, grounded in Emotion Context-Insensitivity theory. It introduces Acoustic-Textual Emotional Inconsistency (ATEI) extracted via a multimodal cross-attention Transformer, combined with a learnable fusion scaling to integrate ATEI with SSL-based acoustic and textual features from counseling conversations. The study demonstrates that embedding-based ATEI representations, particularly when fused with acoustic and textual features through concatenation and with scaling, yield substantial gains over state-of-the-art baselines, achieving up to around 81% subject-level accuracy and improved separation of depression severities. These results underscore the value of cross-modal emotional inconsistency as a diagnostic signal and open avenues for severity-aware, data-efficient depression detection in practical settings.

Abstract

Previous studies have demonstrated that emotional features from a single acoustic sentiment label can enhance depression diagnosis accuracy. Additionally, according to the Emotion Context-Insensitivity theory and our pilot study, individuals with depression might convey negative emotional content in an unexpectedly calm manner, showing a high degree of inconsistency in emotional expressions during natural conversations. So far, few studies have recognized and leveraged the emotional expression inconsistency for depression detection. In this paper, a multimodal cross-attention method is presented to capture the Acoustic-Textual Emotional Inconsistency (ATEI) information. This is achieved by analyzing the intricate local and long-term dependencies of emotional expressions across acoustic and textual domains, as well as the mismatch between the emotional content within both domains. A Transformer-based model is then proposed to integrate this ATEI information with various fusion strategies for detecting depression. Furthermore, a scaling technique is employed to adjust the ATEI feature degree during the fusion process, thereby enhancing the model's ability to discern patients with depression across varying levels of severity. To best of our knowledge, this work is the first to incorporate emotional expression inconsistency information into depression detection. Experimental results on a counseling conversational dataset illustrate the effectiveness of our method.

Investigating Acoustic-Textual Emotional Inconsistency Information for Automatic Depression Detection

TL;DR

This work tackles automatic depression detection from speech by leveraging emotional inconsistency across acoustic and textual modalities, grounded in Emotion Context-Insensitivity theory. It introduces Acoustic-Textual Emotional Inconsistency (ATEI) extracted via a multimodal cross-attention Transformer, combined with a learnable fusion scaling to integrate ATEI with SSL-based acoustic and textual features from counseling conversations. The study demonstrates that embedding-based ATEI representations, particularly when fused with acoustic and textual features through concatenation and with scaling, yield substantial gains over state-of-the-art baselines, achieving up to around 81% subject-level accuracy and improved separation of depression severities. These results underscore the value of cross-modal emotional inconsistency as a diagnostic signal and open avenues for severity-aware, data-efficient depression detection in practical settings.

Abstract

Previous studies have demonstrated that emotional features from a single acoustic sentiment label can enhance depression diagnosis accuracy. Additionally, according to the Emotion Context-Insensitivity theory and our pilot study, individuals with depression might convey negative emotional content in an unexpectedly calm manner, showing a high degree of inconsistency in emotional expressions during natural conversations. So far, few studies have recognized and leveraged the emotional expression inconsistency for depression detection. In this paper, a multimodal cross-attention method is presented to capture the Acoustic-Textual Emotional Inconsistency (ATEI) information. This is achieved by analyzing the intricate local and long-term dependencies of emotional expressions across acoustic and textual domains, as well as the mismatch between the emotional content within both domains. A Transformer-based model is then proposed to integrate this ATEI information with various fusion strategies for detecting depression. Furthermore, a scaling technique is employed to adjust the ATEI feature degree during the fusion process, thereby enhancing the model's ability to discern patients with depression across varying levels of severity. To best of our knowledge, this work is the first to incorporate emotional expression inconsistency information into depression detection. Experimental results on a counseling conversational dataset illustrate the effectiveness of our method.

Paper Structure

This paper contains 19 sections, 16 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Proposed Transformer-based framework using additional Acoustic-Textual Emotional Inconsistency (ATEI) information for automatic depression detection. Both $\mathbf{X}^{(\text{A})}$ and $\mathbf{X}^{(\text{T})}$ contain the short-term universal information related to depression. $\boldsymbol{e}^{(\text{A})}$, $\boldsymbol{e}^{(\text{T})}$ and $\boldsymbol{e}^{\text{(E)}}$ represent the acoustic, textual and ATEI features related to depression over a long time range, respectively.
  • Figure 2: Transformer-based feature aggregation for extracting the acoustic and textual features related to depression.
  • Figure 3: Multimodal cross-attention method for extracting ATEI information.
  • Figure 4: The t-SNE projection was performed on the outputs from the final hidden layer of the depression detection systems in TABLE \ref{['tab:scaled-ATEI']}: (a) "A+T" depression detection baseline, (b) "A+T+E" system incorporating ATEI embedding features without scaling, (c) "A+T+E" system incorporating ATEI embedding features with scaling. The ATEI embedding features of (b) and (c) were derived from the middle fully connected layer (FC2) of Fig. \ref{['fig:ATEI-cues-extraction']}. The points depicted in green, blue, and red correspond to the examples from the healthy controls, individuals with mild depression, and those with moderate depression, respectively. It is better to see it in color.