Probing mental health information in speech foundation models
Marc de Gennes, Adrien Lesage, Martin Denais, Xuan-Nga Cao, Simon Chang, Pierre Van Remoortere, Cyrille Dakhlia, Rachid Riad
TL;DR
This work probes how speech foundation models encode mental-health signals, investigating which pretext tasks and model layers transfer best to depression detection, and how audio context length and pooling affect performance across French and Italian datasets. By evaluating multiple encoders (wav2vec2, HuBERT, Whisper) and conducting embedder-representation and temporal-dynamics probes, the authors reveal that semantic-rich, later-layer representations (especially Whisper) excel on spontaneous speech, while pooling and window-length strategies must be dataset-aware. The results achieve competitive, and in some cases state-of-the-art, performance on depression detection (notably Androids) and provide nuanced guidance on context length and pooling tailored to dataset characteristics and language. The study highlights practical implications for cross-language, non-invasive mental-health screening using speech, while noting multilingual factors as a limitation and proposing future fine-tuning to enhance generalization.
Abstract
Non-invasive methods for diagnosing mental health conditions, such as speech analysis, offer promising potential in modern medicine. Recent advancements in machine learning, particularly speech foundation models, have shown significant promise in detecting mental health states by capturing diverse features. This study investigates which pretext tasks in these models best transfer to mental health detection and examines how different model layers encode features relevant to mental health conditions. We also probed the optimal length of audio segments and the best pooling strategies to improve detection accuracy. Using the Callyope-GP and Androids datasets, we evaluated the models' effectiveness across different languages and speech tasks, aiming to enhance the generalizability of speech-based mental health diagnostics. Our approach achieved SOTA scores in depression detection on the Androids dataset.
