Table of Contents
Fetching ...

LMVD: A Large-Scale Multimodal Vlog Dataset for Depression Detection in the Wild

Lang He, Kai Chen, Junnan Zhao, Yimeng Wang, Ercheng Pei, Haifeng Chen, Jiewei Jiang, Shiqing Zhang, Jie Zhang, Zhongmin Wang, Tao He, Prayag Tiwari

TL;DR

This work introduces LMVD, a large-scale, in-the-wild multimodal vlog dataset for depression detection, built from four platforms and featuring rich audio-visual cues. To harness these signals, the authors propose MDDformer, a cross-fusion transformer that learns complementary information from audio (VGGish) and video (FAUs, landmarks, eye gaze, head pose) features. Empirical results show that MDDformer outperforms a suite of baselines, achieving approximately 76.9% accuracy and related metrics, demonstrating the value of large, diverse, multimodal data for affective computing. The dataset and code availability aim to catalyze research in robust, privacy-conscious depression detection in real-world settings.

Abstract

Depression can significantly impact many aspects of an individual's life, including their personal and social functioning, academic and work performance, and overall quality of life. Many researchers within the field of affective computing are adopting deep learning technology to explore potential patterns related to the detection of depression. However, because of subjects' privacy protection concerns, that data in this area is still scarce, presenting a challenge for the deep discriminative models used in detecting depression. To navigate these obstacles, a large-scale multimodal vlog dataset (LMVD), for depression recognition in the wild is built. In LMVD, which has 1823 samples with 214 hours of the 1475 participants captured from four multimedia platforms (Sina Weibo, Bilibili, Tiktok, and YouTube). A novel architecture termed MDDformer to learn the non-verbal behaviors of individuals is proposed. Extensive validations are performed on the LMVD dataset, demonstrating superior performance for depression detection. We anticipate that the LMVD will contribute a valuable function to the depression detection community. The data and code will released at the link: https://github.com/helang818/LMVD/.

LMVD: A Large-Scale Multimodal Vlog Dataset for Depression Detection in the Wild

TL;DR

This work introduces LMVD, a large-scale, in-the-wild multimodal vlog dataset for depression detection, built from four platforms and featuring rich audio-visual cues. To harness these signals, the authors propose MDDformer, a cross-fusion transformer that learns complementary information from audio (VGGish) and video (FAUs, landmarks, eye gaze, head pose) features. Empirical results show that MDDformer outperforms a suite of baselines, achieving approximately 76.9% accuracy and related metrics, demonstrating the value of large, diverse, multimodal data for affective computing. The dataset and code availability aim to catalyze research in robust, privacy-conscious depression detection in real-world settings.

Abstract

Depression can significantly impact many aspects of an individual's life, including their personal and social functioning, academic and work performance, and overall quality of life. Many researchers within the field of affective computing are adopting deep learning technology to explore potential patterns related to the detection of depression. However, because of subjects' privacy protection concerns, that data in this area is still scarce, presenting a challenge for the deep discriminative models used in detecting depression. To navigate these obstacles, a large-scale multimodal vlog dataset (LMVD), for depression recognition in the wild is built. In LMVD, which has 1823 samples with 214 hours of the 1475 participants captured from four multimedia platforms (Sina Weibo, Bilibili, Tiktok, and YouTube). A novel architecture termed MDDformer to learn the non-verbal behaviors of individuals is proposed. Extensive validations are performed on the LMVD dataset, demonstrating superior performance for depression detection. We anticipate that the LMVD will contribute a valuable function to the depression detection community. The data and code will released at the link: https://github.com/helang818/LMVD/.
Paper Structure (19 sections, 13 equations, 5 figures, 5 tables)

This paper contains 19 sections, 13 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustration of the 68 key points for facial landmarks feature.
  • Figure 2: The MDDformer model comprises three main steps: (a) Multimodal Feature Extraction: For the audio cue, the deep features are extracted by VGGish. For the visual cue, AUs, head pose, landmarks, and eye gaze are extract by TCN architecture. (b) Multimodal Feature Fusion: Cross fusion transformer (CFformer) block leverages cross fusion to learn the combined informative behaviors from the audiovisual cues. (c) Depression Classifier: Two fully connected layers and the softmax function are adopted for predicting the depression.
  • Figure 3: Confusion matrix of the MDDformer and other baseline methods. Each row represents the true labels, and each column represents the predicted values. Element $(m, n)$ indicates the percentage of samples from class $m$ being classified as class $n$.
  • Figure 4: Visualisation of the multimodal features using 3D t-SNE. The red dots represent data from depressed subjects, while the gray dots represent data from healthy control subjects.
  • Figure 5: The grouped bar chart for the different baseline methods and MDDformer.