Integrating Large Language Models into a Tri-Modal Architecture for Automated Depression Classification on the DAIC-WOZ

Santosh V. Patapati

Integrating Large Language Models into a Tri-Modal Architecture for Automated Depression Classification on the DAIC-WOZ

Santosh V. Patapati

TL;DR

This work presents a novel, BiLSTM-based tri-modal model-level fusion architecture for the binary classification of depression from clinical interview recordings that achieves impressive results on the DAIC-WOZ AVEC 2016 Challenge cross-validation split and Leave-One-Subject-Out cross-validation split.

Abstract

Major Depressive Disorder (MDD) is a pervasive mental health condition that affects 300 million people worldwide. This work presents a novel, BiLSTM-based tri-modal model-level fusion architecture for the binary classification of depression from clinical interview recordings. The proposed architecture incorporates Mel Frequency Cepstral Coefficients, Facial Action Units, and uses a two-shot learning based GPT-4 model to process text data. This is the first work to incorporate large language models into a multi-modal architecture for this task. It achieves impressive results on the DAIC-WOZ AVEC 2016 Challenge cross-validation split and Leave-One-Subject-Out cross-validation split, surpassing all baseline models and multiple state-of-the-art models. In Leave-One-Subject-Out testing, it achieves an accuracy of 91.01%, an F1-Score of 85.95%, a precision of 80%, and a recall of 92.86%.

Integrating Large Language Models into a Tri-Modal Architecture for Automated Depression Classification on the DAIC-WOZ

TL;DR

Abstract

Paper Structure (27 sections, 10 figures, 7 tables)

This paper contains 27 sections, 10 figures, 7 tables.

Introduction
Related Works
Mel Frequency Cepstral Coefficients in Audio-Based Models
Facial Action Units in Video-Based Models
Multi-Modal Models
Data Fusion Strategies
Data Collection and Preprocessing
DAIC-WOZ
Dataset Errors
Text Preprocessing
Audio Preprocessing
Audio Segmentation
Audio Feature Extraction
MFCC Normalization
Visual Preprocessing
...and 12 more sections

Figures (10)

Figure 1: Example of Facial Action Unit descriptors in famous photos Tu2019. They can code nearly any anatomically possible facial expression.
Figure 2: Visualization of Early Fusion (left), Late Fusion (middle), and Model-Level Fusion (right) systems Borges2022.
Figure 3: Visualization of the pre-processing steps applied to DAIC-WOZ data.
Figure 4: Visualization of process used to derive MFCCs. The way in which MFCCs were derived in this work is based off the implementation provided by the Slaney auditory toolbox Slaney1998.
Figure 5: Proposed model architecture following Hyperband tuning.
...and 5 more figures

Integrating Large Language Models into a Tri-Modal Architecture for Automated Depression Classification on the DAIC-WOZ

TL;DR

Abstract

Integrating Large Language Models into a Tri-Modal Architecture for Automated Depression Classification on the DAIC-WOZ

Authors

TL;DR

Abstract

Table of Contents

Figures (10)