Table of Contents
Fetching ...

Predicting Individual Depression Symptoms from Acoustic Features During Speech

Sebastian Rodriguez, Sri Harsha Dumpala, Katerina Dikaios, Sheri Rempel, Rudolf Uher, Sageev Oore

TL;DR

This work makes a first step towards using the acoustic features of speech to predict individual items of the depression rating scale before obtaining the final depression prediction, using convolutional and recurrent neural networks.

Abstract

Current automatic depression detection systems provide predictions directly without relying on the individual symptoms/items of depression as denoted in the clinical depression rating scales. In contrast, clinicians assess each item in the depression rating scale in a clinical setting, thus implicitly providing a more detailed rationale for a depression diagnosis. In this work, we make a first step towards using the acoustic features of speech to predict individual items of the depression rating scale before obtaining the final depression prediction. For this, we use convolutional (CNN) and recurrent (long short-term memory (LSTM)) neural networks. We consider different approaches to learning the temporal context of speech. Further, we analyze two variants of voting schemes for individual item prediction and depression detection. We also include an animated visualization that shows an example of item prediction over time as the speech progresses.

Predicting Individual Depression Symptoms from Acoustic Features During Speech

TL;DR

This work makes a first step towards using the acoustic features of speech to predict individual items of the depression rating scale before obtaining the final depression prediction, using convolutional and recurrent neural networks.

Abstract

Current automatic depression detection systems provide predictions directly without relying on the individual symptoms/items of depression as denoted in the clinical depression rating scales. In contrast, clinicians assess each item in the depression rating scale in a clinical setting, thus implicitly providing a more detailed rationale for a depression diagnosis. In this work, we make a first step towards using the acoustic features of speech to predict individual items of the depression rating scale before obtaining the final depression prediction. For this, we use convolutional (CNN) and recurrent (long short-term memory (LSTM)) neural networks. We consider different approaches to learning the temporal context of speech. Further, we analyze two variants of voting schemes for individual item prediction and depression detection. We also include an animated visualization that shows an example of item prediction over time as the speech progresses.
Paper Structure (11 sections, 3 figures, 8 tables)

This paper contains 11 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Outline of our approach exploiting the temporal information in speech for depression assessment. (a) shows the output of model trained on speech recordings provided as a time-series of short segments (input speech segmented into 13 second segments with an overlap of 1 second) for detecting suicidal thoughts -- one of the symptoms of major depressive disorder, (b) The segment level predictions obtained for each item are combined either using hard or soft voting. The individual item predictions are combined to obtain a final decision on depression prediction.
  • Figure 2: Each box represents a 2-d convolution followed by a ReLU activation function. All the convolutions had a stride of (2, 2). ch: Number of channels, k: kernel/filter size, p: padding.
  • Figure 3: Complete architecture of the Spectrogram CNN-LSTM model. The CNN corresponds to the convolutional component explained in Figure \ref{['fig:cnn']}. The LSTM layer has a hidden state size of 64.