Table of Contents
Fetching ...

Speech-based Clinical Depression Screening: An Empirical Study

Yangbin Chen, Chenyang Xu, Chunfeng Liang, Yanbao Tao, Chuan Shi

TL;DR

This work investigates speech-based depression screening across three interaction modes (psychiatric interviews, chatbot conversations, and text readings) using both acoustic and deep speech features. It collects data from 270 participants diagnosed by psychiatrists and validated with MINI, then trains simple classifiers (MLP/SVM) with a majority-vote aggregation over $T$-second clips across $N$ segments per participant. The key findings show that speech from chatbot conversations can match or exceed the performance of clinical interviews, while reading tasks perform worse; deep speech features significantly outperform traditional acoustic features, with the observed benefits amplified by larger $N$ and $T$ values. The results support scalable, non-invasive depression screening tools that maintain high accuracy across interaction modes, enabling broader access and standardization in screening and monitoring.

Abstract

This study investigates the utility of speech signals for AI-based depression screening across varied interaction scenarios, including psychiatric interviews, chatbot conversations, and text readings. Participants include depressed patients recruited from the outpatient clinics of Peking University Sixth Hospital and control group members from the community, all diagnosed by psychiatrists following standardized diagnostic protocols. We extracted acoustic and deep speech features from each participant's segmented recordings. Classifications were made using neural networks or SVMs, with aggregated clip outcomes determining final assessments. Our analysis across interaction scenarios, speech processing techniques, and feature types confirms speech as a crucial marker for depression screening. Specifically, human-computer interaction matches clinical interview efficacy, surpassing reading tasks. Segment duration and quantity significantly affect model performance, with deep speech features substantially outperforming traditional acoustic features.

Speech-based Clinical Depression Screening: An Empirical Study

TL;DR

This work investigates speech-based depression screening across three interaction modes (psychiatric interviews, chatbot conversations, and text readings) using both acoustic and deep speech features. It collects data from 270 participants diagnosed by psychiatrists and validated with MINI, then trains simple classifiers (MLP/SVM) with a majority-vote aggregation over -second clips across segments per participant. The key findings show that speech from chatbot conversations can match or exceed the performance of clinical interviews, while reading tasks perform worse; deep speech features significantly outperform traditional acoustic features, with the observed benefits amplified by larger and values. The results support scalable, non-invasive depression screening tools that maintain high accuracy across interaction modes, enabling broader access and standardization in screening and monitoring.

Abstract

This study investigates the utility of speech signals for AI-based depression screening across varied interaction scenarios, including psychiatric interviews, chatbot conversations, and text readings. Participants include depressed patients recruited from the outpatient clinics of Peking University Sixth Hospital and control group members from the community, all diagnosed by psychiatrists following standardized diagnostic protocols. We extracted acoustic and deep speech features from each participant's segmented recordings. Classifications were made using neural networks or SVMs, with aggregated clip outcomes determining final assessments. Our analysis across interaction scenarios, speech processing techniques, and feature types confirms speech as a crucial marker for depression screening. Specifically, human-computer interaction matches clinical interview efficacy, surpassing reading tasks. Segment duration and quantity significantly affect model performance, with deep speech features substantially outperforming traditional acoustic features.
Paper Structure (11 sections, 3 figures, 2 tables)

This paper contains 11 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Framework of this study which consists of several stages: (1) Interaction scenarios -- psychiatric interviews, chatbot conversations, and text readings; (2) Audio segmentation -- to segment participants' recordings into audio clips; (3) Feature extraction -- to extract acoustic or deep speech features; (4) Classifier -- to do classification with simple MLP or SVM models; (5) Majority vote -- to determine the final prediction for each participant through voting among their clip outcomes.
  • Figure 2: Performance comparison of using different audio features across three interaction scenarios.
  • Figure 3: Performance comparison of using different numbers and duration of audio clips from individuals.