Table of Contents
Fetching ...

ComFeAT: Combination of Neural and Spectral Features for Improved Depression Detection

Orchid Chetia Phukan, Sarthak Jain, Shubham Singh, Muskaan Singh, Arun Balaji Buduru, Rajesh Sharma

TL;DR

The paper tackles depression detection from speech and the drop in real-world performance of neural features from pre-trained models due to domain variability. It proposes ComFeAT, a CNN-based system that fuses neural features (TRILLsson, x-vector) with spectral features (MFCC, LFCC) to improve robustness and accuracy, evaluated on the E-DAIC dataset. The key finding is that while neural features alone perform well, their combination with spectral features yields the best MAE and RMSE, achieving state-of-the-art RMSE on the benchmark. The work also presents a practical deployment with a React/Flask-based UI, enabling end-to-end processing from audio upload to depression intensity prediction, highlighting potential real-world impact in mental health support contexts.

Abstract

In this work, we focus on the detection of depression through speech analysis. Previous research has widely explored features extracted from pre-trained models (PTMs) primarily trained for paralinguistic tasks. Although these features have led to sufficient advances in speech-based depression detection, their performance declines in real-world settings. To address this, in this paper, we introduce ComFeAT, an application that employs a CNN model trained on a combination of features extracted from PTMs, a.k.a. neural features and spectral features to enhance depression detection. Spectral features are robust to domain variations, but, they are not as good as neural features in performance, suprisingly, combining them shows complementary behavior and improves over both neural and spectral features individually. The proposed method also improves over previous state-of-the-art (SOTA) works on E-DAIC benchmark.

ComFeAT: Combination of Neural and Spectral Features for Improved Depression Detection

TL;DR

The paper tackles depression detection from speech and the drop in real-world performance of neural features from pre-trained models due to domain variability. It proposes ComFeAT, a CNN-based system that fuses neural features (TRILLsson, x-vector) with spectral features (MFCC, LFCC) to improve robustness and accuracy, evaluated on the E-DAIC dataset. The key finding is that while neural features alone perform well, their combination with spectral features yields the best MAE and RMSE, achieving state-of-the-art RMSE on the benchmark. The work also presents a practical deployment with a React/Flask-based UI, enabling end-to-end processing from audio upload to depression intensity prediction, highlighting potential real-world impact in mental health support contexts.

Abstract

In this work, we focus on the detection of depression through speech analysis. Previous research has widely explored features extracted from pre-trained models (PTMs) primarily trained for paralinguistic tasks. Although these features have led to sufficient advances in speech-based depression detection, their performance declines in real-world settings. To address this, in this paper, we introduce ComFeAT, an application that employs a CNN model trained on a combination of features extracted from PTMs, a.k.a. neural features and spectral features to enhance depression detection. Spectral features are robust to domain variations, but, they are not as good as neural features in performance, suprisingly, combining them shows complementary behavior and improves over both neural and spectral features individually. The proposed method also improves over previous state-of-the-art (SOTA) works on E-DAIC benchmark.
Paper Structure (4 sections, 2 figures, 1 table)

This paper contains 4 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Proposed Model Architecture
  • Figure 2: Flow Diagram