Table of Contents
Fetching ...

Validating Computational Markers of Depressive Behavior: Cross-Linguistic Speech-Based Depression Detection with Neurophysiological Validation

Fuxiang Tao, Dongwei Li, Shuning Tang, Xuri Ge, Wei Ma, Anna Esposito, Alessandro Vinciarelli

Abstract

Speech-based depression detection has shown promise as an objective diagnostic tool, yet the cross-linguistic robustness of acoustic markers and their neurobiological underpinnings remain underexplored. This study extends Cross-Data Multilevel Attention (CDMA) framework, initially validated on Italian, to investigate these dimensions using a Chinese Mandarin dataset with Electroencephalography (EEG) recordings. We systematically fuse read speech with spontaneous speech across different emotional valences (positive, neutral, negative) to investigate whether emotional arousal is a more critical factor than valence polarity in enhancing detection performance in speech. Additionally, we establish the first neurophysiological validation for a speech-based depression model by correlating its predictions with neural oscillatory patterns during emotional face processing. Our results demonstrate strong cross-linguistic generalizability of the CDMA framework, achieving state-of-the-art performance (F1-score up to 89.6%) on the Chinese dataset, which is comparable to the previous Italian validation. Critically, emotionally valenced speech (both positive and negative) significantly outperformed neutral speech. This comparable performance between positive and negative tasks supports the emotional arousal hypothesis. Most importantly, EEG analysis revealed significant correlations between the model's speech-derived depression estimates and neural oscillatory patterns (theta and alpha bands), demonstrating alignment with established neural markers of emotional dysregulation in depression. This alignment, combined with the model's cross-linguistic robustness, not only supports that the CDMA framework's approach is a universally applicable and neurobiologically validated strategy but also establishes a novel paradigm for the neurophysiological validation of computational mental health models.

Validating Computational Markers of Depressive Behavior: Cross-Linguistic Speech-Based Depression Detection with Neurophysiological Validation

Abstract

Speech-based depression detection has shown promise as an objective diagnostic tool, yet the cross-linguistic robustness of acoustic markers and their neurobiological underpinnings remain underexplored. This study extends Cross-Data Multilevel Attention (CDMA) framework, initially validated on Italian, to investigate these dimensions using a Chinese Mandarin dataset with Electroencephalography (EEG) recordings. We systematically fuse read speech with spontaneous speech across different emotional valences (positive, neutral, negative) to investigate whether emotional arousal is a more critical factor than valence polarity in enhancing detection performance in speech. Additionally, we establish the first neurophysiological validation for a speech-based depression model by correlating its predictions with neural oscillatory patterns during emotional face processing. Our results demonstrate strong cross-linguistic generalizability of the CDMA framework, achieving state-of-the-art performance (F1-score up to 89.6%) on the Chinese dataset, which is comparable to the previous Italian validation. Critically, emotionally valenced speech (both positive and negative) significantly outperformed neutral speech. This comparable performance between positive and negative tasks supports the emotional arousal hypothesis. Most importantly, EEG analysis revealed significant correlations between the model's speech-derived depression estimates and neural oscillatory patterns (theta and alpha bands), demonstrating alignment with established neural markers of emotional dysregulation in depression. This alignment, combined with the model's cross-linguistic robustness, not only supports that the CDMA framework's approach is a universally applicable and neurobiologically validated strategy but also establishes a novel paradigm for the neurophysiological validation of computational mental health models.

Paper Structure

This paper contains 31 sections, 4 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The figure illustrates the proposed framework. Purple arrow shows read speech processing; green, yellow, and red arrows show positive, neutral, and negative spontaneous speech processing, respectively. The symbol $\otimes$ indicates majority voting that combines classification results from all speech types to produce the final depression prediction.
  • Figure 2: Time-frequency representations for the fearful facial expression condition. (A-D) Frontal region (13 electrodes including E23, E18, E16, E10, E3, E19, E11, E4, E20, E12, E5, E118): (A) Time-frequency representations of HCs and MDDs; (B) topographical distribution of the difference in theta (3-7 Hz) and alpha (8-12 Hz) band power between HCs and MDDs during 0-200 ms time windows; (C) time course of theta and alpha power with significant differences ($p < 0.05$); (D) bar plots of mean theta and alpha power within 0-200 ms; ** denotes $p < 0.01$. (E-H) Parieto-occipital region (17 electrodes, E62, E60, E67, E72, E77, E85, E59, E66, E71, E76, E84, E91, E65, E70, E75, E83, E90 included): (E)Time-frequency representations of HCs and MDDs; (F) topographical difference maps (HCs minus MDDs); (G) time course with significant time ($p < 0.05$); (H) bar plots of mean theta and alpha power within 0-200 ms. * denotes $p < 0.05$. Significant group differences are observed in both theta and alpha power for both regions.
  • Figure 3: Time-frequency representations for the sad facial expression condition. (A-D) Frontal region: (A) Time-frequency power fluctuations in HCs and MDDs following stimulus onset; (B) topographical distribution of the difference in theta and alpha power between HCs and MDDs during the 0-200 ms time window, with stronger power observed in HCs; (C) time course of theta and alpha power with shaded areas indicating standard error of the mean (SEM); significant intervals in alpha are marked ($p < 0.05$). * denotes $p < 0.05$; (D) bar plots of mean theta and alpha power within 0-200 ms, showing a significant group difference in alpha ($p < 0.05$), while theta remains non-significant. (E-H) Parieto-occipital region: (E) Time-frequency representations of theta and alpha activity in HCs and MDDs; (F) topographical distribution of group differences (HCs minus MDDs), with higher theta and alpha power in HCs; (G) time course of theta and alpha power with significant intervals indicated ($p < 0.05$); (H) bar plots of mean theta and alpha power within 0-200 ms, with both frequency bands showing significant group differences.
  • Figure 4: Time-frequency representations for the happy facial expression condition. (A-D) Frontal region: (A) Time-frequency representations in HCs and MDDs; (B) topographical distribution of the difference in theta and alpha band power between HCs and MDDs during the 0-200 ms time window, showing increased alpha power in HCs; (C) time course of theta and alpha power with shaded areas indicating standard error of the mean (SEM); no significant group differences were observed; (D) bar plots of mean theta and alpha power within 0-200 ms, confirming non-significant effects. (E-H) Parieto-occipital region: (E) Time-frequency representations of HCs and MDDs showing activity differences in the 400-600 ms window; (F) topographical distribution of group differences (HCs minus MDDs), showing higher theta and alpha power in MDDs; (G) time course of theta and alpha power with significant intervals in alpha indicated ($p < 0.05$). * denotes $p < 0.05$; (H) bar plots of mean theta and alpha power within 400-600 ms; alpha power shows a significant group difference ($p < 0.05$), while theta remains non-significant.
  • Figure 5: Spearman correlations between model-derived depression logits and time-frequency power. (A-C) In the frontal region (0-200 ms, fearful condition), alpha power was negatively correlated with logits from the negative, neutral, and positive speech conditions. (D-F) In the parieto-occipital region (0-200 ms, fearful face condition), alpha power showed significant negative associations with logits under the negative, neutral, and positive speech conditions. (G) Parieto-occipital theta power (0-200 ms, fearful condition) negatively correlated with neutral speech logits. (H) Under the sad face condition (0-200 ms), parieto-occipital alpha power negatively correlated with positive speech logits. (I) In the happy face condition (400-600 ms), parieto-occipital alpha power positively correlated with logits from the negative speech condition. Each dot represents one participant (orange: MDD; green: HC).
  • ...and 1 more figures