Speech-based Clinical Depression Screening: An Empirical Study
Yangbin Chen, Chenyang Xu, Chunfeng Liang, Yanbao Tao, Chuan Shi
TL;DR
This work investigates speech-based depression screening across three interaction modes (psychiatric interviews, chatbot conversations, and text readings) using both acoustic and deep speech features. It collects data from 270 participants diagnosed by psychiatrists and validated with MINI, then trains simple classifiers (MLP/SVM) with a majority-vote aggregation over $T$-second clips across $N$ segments per participant. The key findings show that speech from chatbot conversations can match or exceed the performance of clinical interviews, while reading tasks perform worse; deep speech features significantly outperform traditional acoustic features, with the observed benefits amplified by larger $N$ and $T$ values. The results support scalable, non-invasive depression screening tools that maintain high accuracy across interaction modes, enabling broader access and standardization in screening and monitoring.
Abstract
This study investigates the utility of speech signals for AI-based depression screening across varied interaction scenarios, including psychiatric interviews, chatbot conversations, and text readings. Participants include depressed patients recruited from the outpatient clinics of Peking University Sixth Hospital and control group members from the community, all diagnosed by psychiatrists following standardized diagnostic protocols. We extracted acoustic and deep speech features from each participant's segmented recordings. Classifications were made using neural networks or SVMs, with aggregated clip outcomes determining final assessments. Our analysis across interaction scenarios, speech processing techniques, and feature types confirms speech as a crucial marker for depression screening. Specifically, human-computer interaction matches clinical interview efficacy, surpassing reading tasks. Segment duration and quantity significantly affect model performance, with deep speech features substantially outperforming traditional acoustic features.
