Table of Contents
Fetching ...

Probing mental health information in speech foundation models

Marc de Gennes, Adrien Lesage, Martin Denais, Xuan-Nga Cao, Simon Chang, Pierre Van Remoortere, Cyrille Dakhlia, Rachid Riad

TL;DR

This work probes how speech foundation models encode mental-health signals, investigating which pretext tasks and model layers transfer best to depression detection, and how audio context length and pooling affect performance across French and Italian datasets. By evaluating multiple encoders (wav2vec2, HuBERT, Whisper) and conducting embedder-representation and temporal-dynamics probes, the authors reveal that semantic-rich, later-layer representations (especially Whisper) excel on spontaneous speech, while pooling and window-length strategies must be dataset-aware. The results achieve competitive, and in some cases state-of-the-art, performance on depression detection (notably Androids) and provide nuanced guidance on context length and pooling tailored to dataset characteristics and language. The study highlights practical implications for cross-language, non-invasive mental-health screening using speech, while noting multilingual factors as a limitation and proposing future fine-tuning to enhance generalization.

Abstract

Non-invasive methods for diagnosing mental health conditions, such as speech analysis, offer promising potential in modern medicine. Recent advancements in machine learning, particularly speech foundation models, have shown significant promise in detecting mental health states by capturing diverse features. This study investigates which pretext tasks in these models best transfer to mental health detection and examines how different model layers encode features relevant to mental health conditions. We also probed the optimal length of audio segments and the best pooling strategies to improve detection accuracy. Using the Callyope-GP and Androids datasets, we evaluated the models' effectiveness across different languages and speech tasks, aiming to enhance the generalizability of speech-based mental health diagnostics. Our approach achieved SOTA scores in depression detection on the Androids dataset.

Probing mental health information in speech foundation models

TL;DR

This work probes how speech foundation models encode mental-health signals, investigating which pretext tasks and model layers transfer best to depression detection, and how audio context length and pooling affect performance across French and Italian datasets. By evaluating multiple encoders (wav2vec2, HuBERT, Whisper) and conducting embedder-representation and temporal-dynamics probes, the authors reveal that semantic-rich, later-layer representations (especially Whisper) excel on spontaneous speech, while pooling and window-length strategies must be dataset-aware. The results achieve competitive, and in some cases state-of-the-art, performance on depression detection (notably Androids) and provide nuanced guidance on context length and pooling tailored to dataset characteristics and language. The study highlights practical implications for cross-language, non-invasive mental-health screening using speech, while noting multilingual factors as a limitation and proposing future fine-tuning to enhance generalization.

Abstract

Non-invasive methods for diagnosing mental health conditions, such as speech analysis, offer promising potential in modern medicine. Recent advancements in machine learning, particularly speech foundation models, have shown significant promise in detecting mental health states by capturing diverse features. This study investigates which pretext tasks in these models best transfer to mental health detection and examines how different model layers encode features relevant to mental health conditions. We also probed the optimal length of audio segments and the best pooling strategies to improve detection accuracy. Using the Callyope-GP and Androids datasets, we evaluated the models' effectiveness across different languages and speech tasks, aiming to enhance the generalizability of speech-based mental health diagnostics. Our approach achieved SOTA scores in depression detection on the Androids dataset.
Paper Structure (13 sections, 1 equation, 4 figures, 2 tables)

This paper contains 13 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Performance of encoders across their layers. Each graph shows the mean $F_1$ score as a function of the index layer for different models. Graphs (a) and (b) use the Androids corpus, while (c) and (d) use the Callyope-GP corpus. Graphs (a) and (c) correspond to the spontaneous task, and (b) and (d) correspond to the elicited task.
  • Figure 2: Effect of window size and pooling on performance. Each graph shows the mean $F_1$ score as a function of the pooling parameter $\omega$, for window sizes ranging from 0.5s to 20s. Graphs (a) and (b) use the Androids corpus, while (c) and (d) use the Callyope-GP corpus. Graphs (a) and (c) correspond to the HuBERT-XL model, and (b) and (d) correspond to the Whisper-L.
  • Figure 3: Effect of window size and pooling on a balanced version of the Callyope-GP Corpus. The datasets are balanced by randomly undersampling the majority class. Parameters are the same as in Figure \ref{['fig:second_experiment']}(d).
  • Figure 4: Probing the performance of models on a single window, for different window sizes. (a) On the Androids Corpus, models show high performance even with as little as 5 or 10 seconds of audio. (b) On the Callyope-GP corpus, performance is significantly reduced, highlighting the importance of pooling and sampling strategies. (c) Undersampling the Callyope-GP corpus significantly improves performance, with the best results obtained for longer audio window sizes.