Table of Contents
Fetching ...

A Depression Detection Method Based on Multi-Modal Feature Fusion Using Cross-Attention

Shengjie Li, Yinhao Xiao

TL;DR

This work targets early depression detection from social media by fusing lexical text features with behavioral statistics through a cross-attention-based multimodal framework. The proposed Multi-Modal Feature Fusion Network based on Cross-attention (MFFNC) uses MacBERT to extract word embeddings and a Transformer module for task-specific context, fusing features via cross-attention before an MLP classifier. On the Weibo User Depression Detection Dataset (WU3D), the method achieves high accuracy (0.9495) and F1 (0.9469), outperforming several baselines and showing robustness to ablations. The approach demonstrates the value of cross-modal integration for nuanced emotion and behavior analysis and offers a scalable path for real-time mental health monitoring across social platforms.

Abstract

Depression, a prevalent and serious mental health issue, affects approximately 3.8\% of the global population. Despite the existence of effective treatments, over 75\% of individuals in low- and middle-income countries remain untreated, partly due to the challenge in accurately diagnosing depression in its early stages. This paper introduces a novel method for detecting depression based on multi-modal feature fusion utilizing cross-attention. By employing MacBERT as a pre-training model to extract lexical features from text and incorporating an additional Transformer module to refine task-specific contextual understanding, the model's adaptability to the targeted task is enhanced. Diverging from previous practices of simply concatenating multimodal features, this approach leverages cross-attention for feature integration, significantly improving the accuracy in depression detection and enabling a more comprehensive and precise analysis of user emotions and behaviors. Furthermore, a Multi-Modal Feature Fusion Network based on Cross-Attention (MFFNC) is constructed, demonstrating exceptional performance in the task of depression identification. The experimental results indicate that our method achieves an accuracy of 0.9495 on the test dataset, marking a substantial improvement over existing approaches. Moreover, it outlines a promising methodology for other social media platforms and tasks involving multi-modal processing. Timely identification and intervention for individuals with depression are crucial for saving lives, highlighting the immense potential of technology in facilitating early intervention for mental health issues.

A Depression Detection Method Based on Multi-Modal Feature Fusion Using Cross-Attention

TL;DR

This work targets early depression detection from social media by fusing lexical text features with behavioral statistics through a cross-attention-based multimodal framework. The proposed Multi-Modal Feature Fusion Network based on Cross-attention (MFFNC) uses MacBERT to extract word embeddings and a Transformer module for task-specific context, fusing features via cross-attention before an MLP classifier. On the Weibo User Depression Detection Dataset (WU3D), the method achieves high accuracy (0.9495) and F1 (0.9469), outperforming several baselines and showing robustness to ablations. The approach demonstrates the value of cross-modal integration for nuanced emotion and behavior analysis and offers a scalable path for real-time mental health monitoring across social platforms.

Abstract

Depression, a prevalent and serious mental health issue, affects approximately 3.8\% of the global population. Despite the existence of effective treatments, over 75\% of individuals in low- and middle-income countries remain untreated, partly due to the challenge in accurately diagnosing depression in its early stages. This paper introduces a novel method for detecting depression based on multi-modal feature fusion utilizing cross-attention. By employing MacBERT as a pre-training model to extract lexical features from text and incorporating an additional Transformer module to refine task-specific contextual understanding, the model's adaptability to the targeted task is enhanced. Diverging from previous practices of simply concatenating multimodal features, this approach leverages cross-attention for feature integration, significantly improving the accuracy in depression detection and enabling a more comprehensive and precise analysis of user emotions and behaviors. Furthermore, a Multi-Modal Feature Fusion Network based on Cross-Attention (MFFNC) is constructed, demonstrating exceptional performance in the task of depression identification. The experimental results indicate that our method achieves an accuracy of 0.9495 on the test dataset, marking a substantial improvement over existing approaches. Moreover, it outlines a promising methodology for other social media platforms and tasks involving multi-modal processing. Timely identification and intervention for individuals with depression are crucial for saving lives, highlighting the immense potential of technology in facilitating early intervention for mental health issues.
Paper Structure (33 sections, 11 equations, 5 figures, 6 tables)

This paper contains 33 sections, 11 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The data structure for each user in the WU3D comprises the following attributes: gender, profile information, birthday, the number of followers, the number of followings, and a collection of tweets. Each tweet within this collection includes details such as text content, posting time, presence of pictures, the number of likes received, the number of times it was forwarded, the number of comments, and an indicator specifying whether the tweet is an original post or a retweet.
  • Figure 2: The Framework of the Network
  • Figure 3: The Computation Process of Cross-attention
  • Figure 4: Case of patients with depression
  • Figure 5: Validation Accuracy Over Iterations