Table of Contents
Fetching ...

Depression Detection and Analysis using Large Language Models on Textual and Audio-Visual Modalities

Chayan Tank, Sarthak Pol, Vinayak Katoch, Shaina Mehta, Avinash Anand, Rajiv Ratn Shah

TL;DR

This work investigates depression detection and severity analysis using multi-modal data (text, audio, and visual) by integrating state-of-the-art large language models. Textual analysis with GPT-4 and related LLMs achieves state-of-the-art regression performance on PHQ-8 prediction from transcripts derived via Whisper, while audio-visual fusion offers competitive results. The authors contribute a detailed multi-modal methodology, a comprehensive evaluation on the E-DAIC/AVEC-2019 setting, and evidence that textual LLM-based approaches can surpass traditional multi-modal architectures in this domain. They also reveal challenges with data size and limitations in audio-visual alignments, pointing toward future work in robust multimodal LLM fusion and larger annotated datasets for clinically relevant depression assessment.

Abstract

Depression has proven to be a significant public health issue, profoundly affecting the psychological well-being of individuals. If it remains undiagnosed, depression can lead to severe health issues, which can manifest physically and even lead to suicide. Generally, Diagnosing depression or any other mental disorder involves conducting semi-structured interviews alongside supplementary questionnaires, including variants of the Patient Health Questionnaire (PHQ) by Clinicians and mental health professionals. This approach places significant reliance on the experience and judgment of trained physicians, making the diagnosis susceptible to personal biases. Given that the underlying mechanisms causing depression are still being actively researched, physicians often face challenges in diagnosing and treating the condition, particularly in its early stages of clinical presentation. Recently, significant strides have been made in Artificial neural computing to solve problems involving text, image, and speech in various domains. Our analysis has aimed to leverage these state-of-the-art (SOTA) models in our experiments to achieve optimal outcomes leveraging multiple modalities. The experiments were performed on the Extended Distress Analysis Interview Corpus Wizard of Oz dataset (E-DAIC) corpus presented in the Audio/Visual Emotion Challenge (AVEC) 2019 Challenge. The proposed solutions demonstrate better results achieved by Proprietary and Open-source Large Language Models (LLMs), which achieved a Root Mean Square Error (RMSE) score of 3.98 on Textual Modality, beating the AVEC 2019 challenge baseline results and current SOTA regression analysis architectures. Additionally, the proposed solution achieved an accuracy of 71.43% in the classification task. The paper also includes a novel audio-visual multi-modal network that predicts PHQ-8 scores with an RMSE of 6.51.

Depression Detection and Analysis using Large Language Models on Textual and Audio-Visual Modalities

TL;DR

This work investigates depression detection and severity analysis using multi-modal data (text, audio, and visual) by integrating state-of-the-art large language models. Textual analysis with GPT-4 and related LLMs achieves state-of-the-art regression performance on PHQ-8 prediction from transcripts derived via Whisper, while audio-visual fusion offers competitive results. The authors contribute a detailed multi-modal methodology, a comprehensive evaluation on the E-DAIC/AVEC-2019 setting, and evidence that textual LLM-based approaches can surpass traditional multi-modal architectures in this domain. They also reveal challenges with data size and limitations in audio-visual alignments, pointing toward future work in robust multimodal LLM fusion and larger annotated datasets for clinically relevant depression assessment.

Abstract

Depression has proven to be a significant public health issue, profoundly affecting the psychological well-being of individuals. If it remains undiagnosed, depression can lead to severe health issues, which can manifest physically and even lead to suicide. Generally, Diagnosing depression or any other mental disorder involves conducting semi-structured interviews alongside supplementary questionnaires, including variants of the Patient Health Questionnaire (PHQ) by Clinicians and mental health professionals. This approach places significant reliance on the experience and judgment of trained physicians, making the diagnosis susceptible to personal biases. Given that the underlying mechanisms causing depression are still being actively researched, physicians often face challenges in diagnosing and treating the condition, particularly in its early stages of clinical presentation. Recently, significant strides have been made in Artificial neural computing to solve problems involving text, image, and speech in various domains. Our analysis has aimed to leverage these state-of-the-art (SOTA) models in our experiments to achieve optimal outcomes leveraging multiple modalities. The experiments were performed on the Extended Distress Analysis Interview Corpus Wizard of Oz dataset (E-DAIC) corpus presented in the Audio/Visual Emotion Challenge (AVEC) 2019 Challenge. The proposed solutions demonstrate better results achieved by Proprietary and Open-source Large Language Models (LLMs), which achieved a Root Mean Square Error (RMSE) score of 3.98 on Textual Modality, beating the AVEC 2019 challenge baseline results and current SOTA regression analysis architectures. Additionally, the proposed solution achieved an accuracy of 71.43% in the classification task. The paper also includes a novel audio-visual multi-modal network that predicts PHQ-8 scores with an RMSE of 6.51.
Paper Structure (24 sections, 3 equations, 9 figures, 9 tables)

This paper contains 24 sections, 3 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: The Proposed Textual Network & Audio - visual network, which predicts the PHQ-8 scores of patients using their Audio, visual and textual clues. In the Textual Network, we have used whisper to extract the transcripts from audio and input them to LLMs along with prompts for PHQ-8 Score and Class prediction. In the Audio-Visual network, we use a Whisper + BiLSTM-based network, which outputs the Predicted PHQ-8 scores
  • Figure 2: Number of Depressed & Non-Depressed according to PHQ-8 score of Train Set
  • Figure 3: Distribution of Participants based on PHQ-8 Scores of Train Set
  • Figure 4: Gender Distribution Based on PHQ-8 Binary Scores of Train Set
  • Figure 5: Gender Distribution of Train Set
  • ...and 4 more figures