Table of Contents
Fetching ...

Enhancing Depression Detection via Question-wise Modality Fusion

Aishik Mandal, Dana Atzil-Slonim, Thamar Solorio, Iryna Gurevych

TL;DR

The paper tackles automated depression severity assessment using multimodal data by introducing QuestMF, a framework that performs question-wise (per PHQ-8 item) modality fusion and outputs per-question scores, addressing the per-question contribution of each modality and the ordinal nature of the labels. It integrates turn-based encoders for text, audio, and video, and uses cross-attention fusion to create a per-question fused representation, trained with Imbalanced Ordinal Log-Loss (ImbOLL) to handle label imbalance. Empirically, QuestMF with ImbOLL achieves performance comparable to state-of-the-art methods on the E-DAIC dataset while significantly enhancing interpretability through per-question predictions. The method holds promise for clinician-guided, symptom-specific interventions and can be extended to other questionnaires and longitudinal clinical data.

Abstract

Depression is a highly prevalent and disabling condition that incurs substantial personal and societal costs. Current depression diagnosis involves determining the depression severity of a person through self-reported questionnaires or interviews conducted by clinicians. This often leads to delayed treatment and involves substantial human resources. Thus, several works try to automate the process using multimodal data. However, they usually overlook the following: i) The variable contribution of each modality for each question in the questionnaire and ii) Using ordinal classification for the task. This results in sub-optimal fusion and training methods. In this work, we propose a novel Question-wise Modality Fusion (QuestMF) framework trained with a novel Imbalanced Ordinal Log-Loss (ImbOLL) function to tackle these issues. The performance of our framework is comparable to the current state-of-the-art models on the E-DAIC dataset and enhances interpretability by predicting scores for each question. This will help clinicians identify an individual's symptoms, allowing them to customise their interventions accordingly. We also make the code for the QuestMF framework publicly available.

Enhancing Depression Detection via Question-wise Modality Fusion

TL;DR

The paper tackles automated depression severity assessment using multimodal data by introducing QuestMF, a framework that performs question-wise (per PHQ-8 item) modality fusion and outputs per-question scores, addressing the per-question contribution of each modality and the ordinal nature of the labels. It integrates turn-based encoders for text, audio, and video, and uses cross-attention fusion to create a per-question fused representation, trained with Imbalanced Ordinal Log-Loss (ImbOLL) to handle label imbalance. Empirically, QuestMF with ImbOLL achieves performance comparable to state-of-the-art methods on the E-DAIC dataset while significantly enhancing interpretability through per-question predictions. The method holds promise for clinician-guided, symptom-specific interventions and can be extended to other questionnaires and longitudinal clinical data.

Abstract

Depression is a highly prevalent and disabling condition that incurs substantial personal and societal costs. Current depression diagnosis involves determining the depression severity of a person through self-reported questionnaires or interviews conducted by clinicians. This often leads to delayed treatment and involves substantial human resources. Thus, several works try to automate the process using multimodal data. However, they usually overlook the following: i) The variable contribution of each modality for each question in the questionnaire and ii) Using ordinal classification for the task. This results in sub-optimal fusion and training methods. In this work, we propose a novel Question-wise Modality Fusion (QuestMF) framework trained with a novel Imbalanced Ordinal Log-Loss (ImbOLL) function to tackle these issues. The performance of our framework is comparable to the current state-of-the-art models on the E-DAIC dataset and enhances interpretability by predicting scores for each question. This will help clinicians identify an individual's symptoms, allowing them to customise their interventions accordingly. We also make the code for the QuestMF framework publicly available.

Paper Structure

This paper contains 27 sections, 5 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Proposed QuestMF framework to predict depression severity score. Here, Qx denotes Question number x in the questionnaire. MLP denotes Multilayer Perceptron, which is used as the classification head. Each question is scored among classes $\{0,1,2,3\}$. These scores are then added to get the total score $\in \{0,1,2,...,3n\}$.
  • Figure 2: Architecture of single modality encoder models. We use a turn-based architecture to encode multi-turn dialogue data.
  • Figure 3: Architecture of two-modality fused models. We use cross-attention layers for interaction among modalities $M1$ and $M2$. In cross-attention, X $\rightarrow$ Y denotes that the Y modality encoding is used as the query and the X modality encoding as the key and value.
  • Figure 4: Architecture of the three-modality fused model. In cross-attention, X $\rightarrow$ Y denotes that the Y modality encoding is used as the query and the X modality encoding as the key and value.
  • Figure 5: Validation CCC for each question with different modality models. Here, T refers to Text, A refers to Audio, and V refers to Video. An addition between the modalities denotes using a fusion of them. The video model for question $8$ gives the same scores to all data points. Thus, its CCC is not valid and is not shown in the graph.