An interpretable speech foundation model for depression detection by revealing prediction-relevant acoustic features from long speech
Qingkun Deng, Saturnino Luz, Sofia de la Fuente Garcia
TL;DR
This work tackles two key barriers in speech-based depression detection: reliance on short, segment-labeled data and lack of interpretability. It introduces a speech-level Audio Spectrogram Transformer that operates on long speech and a frame-based interpretation pipeline using gradient-weighted attention to reveal sentence- and frame-level relevancy, subsequently mapping relevant frames to OpenSMILE features. Experiments on the D-Vlog dataset show that long-speech processing improves AUC to 0.772 compared with 0.714 for a segment-level baseline, and perturbation analyses confirm the importance of the identified relevant regions. The interpretability framework uncovers clinically meaningful acoustic cues, such as reduced loudness and F0, supporting more responsible and clinically applicable AI for depression screening.
Abstract
Speech-based depression detection tools could aid early screening. Here, we propose an interpretable speech foundation model approach to enhance the clinical applicability of such tools. We introduce a speech-level Audio Spectrogram Transformer (AST) to detect depression using long-duration speech instead of short segments, along with a novel interpretation method that reveals prediction-relevant acoustic features for clinician interpretation. Our experiments show the proposed model outperforms a segment-level AST, highlighting the impact of segment-level labelling noise and the advantage of leveraging longer speech duration for more reliable depression detection. Through interpretation, we observe our model identifies reduced loudness and F0 as relevant depression signals, aligning with documented clinical findings. This interpretability supports a responsible AI approach for speech-based depression detection, rendering such tools more clinically applicable.
