Table of Contents
Fetching ...

We Care: Multimodal Depression Detection and Knowledge Infused Mental Health Therapeutic Response Generation

Palash Moon, Pushpak Bhattacharyya

TL;DR

The paper tackles depression detection in real-life settings by leveraging a multimodal approach and real-world data, addressing the gap left by laboratory-only datasets. It introduces the Extended D-vlog dataset, uses the TVLT multimodal transformer for detection, and employs wav2vec2+spectrogram audio features with ViT video features and BERT text embeddings. For therapeutic support, it combines ABC-based distortion detection with the Mistral-7B-Instruct model and a RAG-based external knowledge base to generate CBT-informed responses, achieving strong distortion assessment (70.1) and classification (30.9), and high semantic alignment (88.7%) with ground-truth therapy prompts. The work demonstrates improved generalization to clinical settings and proposes a path toward safe, knowledge-grounded AI-assisted mental health support, while acknowledging dataset bias, generalization limits, and ethical considerations.

Abstract

The detection of depression through non-verbal cues has gained significant attention. Previous research predominantly centred on identifying depression within the confines of controlled laboratory environments, often with the supervision of psychologists or counsellors. Unfortunately, datasets generated in such controlled settings may struggle to account for individual behaviours in real-life situations. In response to this limitation, we present the Extended D-vlog dataset, encompassing a collection of 1, 261 YouTube vlogs. Additionally, the emergence of large language models (LLMs) like GPT3.5, and GPT4 has sparked interest in their potential they can act like mental health professionals. Yet, the readiness of these LLM models to be used in real-life settings is still a concern as they can give wrong responses that can harm the users. We introduce a virtual agent serving as an initial contact for mental health patients, offering Cognitive Behavioral Therapy (CBT)-based responses. It comprises two core functions: 1. Identifying depression in individuals, and 2. Delivering CBT-based therapeutic responses. Our Mistral model achieved impressive scores of 70.1% and 30.9% for distortion assessment and classification, along with a Bert score of 88.7%. Moreover, utilizing the TVLT model on our Multimodal Extended D-vlog Dataset yielded outstanding results, with an impressive F1-score of 67.8%

We Care: Multimodal Depression Detection and Knowledge Infused Mental Health Therapeutic Response Generation

TL;DR

The paper tackles depression detection in real-life settings by leveraging a multimodal approach and real-world data, addressing the gap left by laboratory-only datasets. It introduces the Extended D-vlog dataset, uses the TVLT multimodal transformer for detection, and employs wav2vec2+spectrogram audio features with ViT video features and BERT text embeddings. For therapeutic support, it combines ABC-based distortion detection with the Mistral-7B-Instruct model and a RAG-based external knowledge base to generate CBT-informed responses, achieving strong distortion assessment (70.1) and classification (30.9), and high semantic alignment (88.7%) with ground-truth therapy prompts. The work demonstrates improved generalization to clinical settings and proposes a path toward safe, knowledge-grounded AI-assisted mental health support, while acknowledging dataset bias, generalization limits, and ethical considerations.

Abstract

The detection of depression through non-verbal cues has gained significant attention. Previous research predominantly centred on identifying depression within the confines of controlled laboratory environments, often with the supervision of psychologists or counsellors. Unfortunately, datasets generated in such controlled settings may struggle to account for individual behaviours in real-life situations. In response to this limitation, we present the Extended D-vlog dataset, encompassing a collection of 1, 261 YouTube vlogs. Additionally, the emergence of large language models (LLMs) like GPT3.5, and GPT4 has sparked interest in their potential they can act like mental health professionals. Yet, the readiness of these LLM models to be used in real-life settings is still a concern as they can give wrong responses that can harm the users. We introduce a virtual agent serving as an initial contact for mental health patients, offering Cognitive Behavioral Therapy (CBT)-based responses. It comprises two core functions: 1. Identifying depression in individuals, and 2. Delivering CBT-based therapeutic responses. Our Mistral model achieved impressive scores of 70.1% and 30.9% for distortion assessment and classification, along with a Bert score of 88.7%. Moreover, utilizing the TVLT model on our Multimodal Extended D-vlog Dataset yielded outstanding results, with an impressive F1-score of 67.8%
Paper Structure (28 sections, 1 equation, 2 figures, 11 tables)

This paper contains 28 sections, 1 equation, 2 figures, 11 tables.

Figures (2)

  • Figure 1: The Above figure shows the distribution of various types of Depressive vlogs. where MDD is Major Depressive Disorder, Bipolar Disorder is also called as Manic Disorder.
  • Figure 2: In the Above Architecture we leverage three different modalities such as video, audio and text where text is extracted from the audio segment using the Whisper ASR Model. we then preprocess all three modalities and pass them to the model where we get the fused representation of all three modalities. This fused representation is then passed to the feed-forward Neural Network with a sigmoid function to determine whether the individual exhibits signs of Depression or is in a Normal state.