Table of Contents
Fetching ...

An interpretable speech foundation model for depression detection by revealing prediction-relevant acoustic features from long speech

Qingkun Deng, Saturnino Luz, Sofia de la Fuente Garcia

TL;DR

This work tackles two key barriers in speech-based depression detection: reliance on short, segment-labeled data and lack of interpretability. It introduces a speech-level Audio Spectrogram Transformer that operates on long speech and a frame-based interpretation pipeline using gradient-weighted attention to reveal sentence- and frame-level relevancy, subsequently mapping relevant frames to OpenSMILE features. Experiments on the D-Vlog dataset show that long-speech processing improves AUC to 0.772 compared with 0.714 for a segment-level baseline, and perturbation analyses confirm the importance of the identified relevant regions. The interpretability framework uncovers clinically meaningful acoustic cues, such as reduced loudness and F0, supporting more responsible and clinically applicable AI for depression screening.

Abstract

Speech-based depression detection tools could aid early screening. Here, we propose an interpretable speech foundation model approach to enhance the clinical applicability of such tools. We introduce a speech-level Audio Spectrogram Transformer (AST) to detect depression using long-duration speech instead of short segments, along with a novel interpretation method that reveals prediction-relevant acoustic features for clinician interpretation. Our experiments show the proposed model outperforms a segment-level AST, highlighting the impact of segment-level labelling noise and the advantage of leveraging longer speech duration for more reliable depression detection. Through interpretation, we observe our model identifies reduced loudness and F0 as relevant depression signals, aligning with documented clinical findings. This interpretability supports a responsible AI approach for speech-based depression detection, rendering such tools more clinically applicable.

An interpretable speech foundation model for depression detection by revealing prediction-relevant acoustic features from long speech

TL;DR

This work tackles two key barriers in speech-based depression detection: reliance on short, segment-labeled data and lack of interpretability. It introduces a speech-level Audio Spectrogram Transformer that operates on long speech and a frame-based interpretation pipeline using gradient-weighted attention to reveal sentence- and frame-level relevancy, subsequently mapping relevant frames to OpenSMILE features. Experiments on the D-Vlog dataset show that long-speech processing improves AUC to 0.772 compared with 0.714 for a segment-level baseline, and perturbation analyses confirm the importance of the identified relevant regions. The interpretability framework uncovers clinically meaningful acoustic cues, such as reduced loudness and F0, supporting more responsible and clinically applicable AI for depression screening.

Abstract

Speech-based depression detection tools could aid early screening. Here, we propose an interpretable speech foundation model approach to enhance the clinical applicability of such tools. We introduce a speech-level Audio Spectrogram Transformer (AST) to detect depression using long-duration speech instead of short segments, along with a novel interpretation method that reveals prediction-relevant acoustic features for clinician interpretation. Our experiments show the proposed model outperforms a segment-level AST, highlighting the impact of segment-level labelling noise and the advantage of leveraging longer speech duration for more reliable depression detection. Through interpretation, we observe our model identifies reduced loudness and F0 as relevant depression signals, aligning with documented clinical findings. This interpretability supports a responsible AI approach for speech-based depression detection, rendering such tools more clinically applicable.
Paper Structure (13 sections, 2 equations, 3 figures, 2 tables)

This paper contains 13 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Workflow of the frame-based attention interpretation method. For demonstration purposes, the long spectrogram represents a long speech interval of ten sentences as an input for the proposed model. Here, the speech-level interpretation first identifies the most relevant five sentences, with indexes of 2, 5, 7, 8, and 9. Then, the sentence-level interpretation identifies the relevant frames, using a relevancy threshold of 0.3, for each sentence. Lastly, the waveform signals that temporally correspond to the relevant frames are identified, which are thereby processed by OpenSMILE for relevant acoustic feature extraction.
  • Figure 2: Violin plot of relevant acoustic feature value distributions (residualized for sex and standardized) between true positives (n = 60) and true negatives (n = 41). Relevant acoustic features were extracted from the waveform signals temporally corresponding to the spectrogram frames with relevancy scores higher than 0.3 from each sample's five most relevant sentences. For true negatives, gradients of the output for the "normal" class regarding the attention scores were used to weigh the attention maps. Statistical significance was assessed using the Mann-Whitney U test, with Bonferroni correction applied for multiple comparisons. Significance levels were denoted as: $ns$ for not significant; * for $p \leq 0.002$, indicating significant differences; ** for $p \leq 0.0004$; and *** for $p \leq 4 \times 10^{-5}$.
  • Figure 3: Perturbation test results for the proposed model. Accuracy was computed using a decision threshold of 0.527, reflecting the depression prevalence in the data. In the first test, sentence-level spectrograms were incrementally excluded in descending order of relevance, 10 at a time. In the second, only frames with a relevance score above 0.3 within these sentences were removed. The results of these two tests were benchmarked against random exclusions: one with random sentence spectrogram exclusions, and another with random frames exclusions (30% within relevant sentence spectrogrames). Accuracy never drops below 47.18%, likely due to the model’s bias towards positive predictions in the presence of a slight class imbalance.