Table of Contents
Fetching ...

Towards Stable Cross-Domain Depression Recognition under Missing Modalities

Jiuyi Chen, Mingkui Tan, Haifeng Lu, Qiuna Xu, Zhihua Wang, Runhao Zeng, Xiping Hu

TL;DR

This work introduces SCD-MLLM, a unified cross-domain multimodal depression recognition framework built on a Large Language Model backbone. It tackles cross-dataset heterogeneity and missing modalities via the Multi-Source Data Input Adapter (MDIA) and Modality-Aware Adaptive Fusion Module (MAFM), with a Multi-Cue Fusion Video Encoder to standardize visual cues. Extensive cross-dataset experiments on CMDC, AVEC2014, DAIC-WOZ, DVlog, and EATD demonstrate superior cross-domain generalization, robustness to missing modalities, and competitive performance against state-of-the-art and commercial LLMs. The approach delivers a practical, scalable solution for real-world depression screening across diverse data sources and modalities.

Abstract

Depression poses serious public health risks, including suicide, underscoring the urgency of timely and scalable screening. Multimodal automatic depression detection (ADD) offers a promising solution; however, widely studied audio- and video-based ADD methods lack a unified, generalizable framework for diverse depression recognition scenarios and show limited stability to missing modalities, which are common in real-world data. In this work, we propose a unified framework for Stable Cross-Domain Depression Recognition based on Multimodal Large Language Model (SCD-MLLM). The framework supports the integration and processing of heterogeneous depression-related data collected from varied sources while maintaining stability in the presence of incomplete modality inputs. Specifically, SCD-MLLM introduces two key components: (i) Multi-Source Data Input Adapter (MDIA), which employs masking mechanism and task-specific prompts to transform heterogeneous depression-related inputs into uniform token sequences, addressing inconsistency across diverse data sources; (ii) Modality-Aware Adaptive Fusion Module (MAFM), which adaptively integrates audio and visual features via a shared projection mechanism, enhancing resilience under missing modality conditions. e conduct comprehensive experiments under multi-dataset joint training settings on five publicly available and heterogeneous depression datasets from diverse scenarios: CMDC, AVEC2014, DAIC-WOZ, DVlog, and EATD. Across both complete and partial modality settings, SCD-MLLM outperforms state-of-the-art (SOTA) models as well as leading commercial LLMs (Gemini and GPT), demonstrating superior cross-domain generalization, enhanced ability to capture multimodal cues of depression, and strong stability to missing modality cases in real-world applications.

Towards Stable Cross-Domain Depression Recognition under Missing Modalities

TL;DR

This work introduces SCD-MLLM, a unified cross-domain multimodal depression recognition framework built on a Large Language Model backbone. It tackles cross-dataset heterogeneity and missing modalities via the Multi-Source Data Input Adapter (MDIA) and Modality-Aware Adaptive Fusion Module (MAFM), with a Multi-Cue Fusion Video Encoder to standardize visual cues. Extensive cross-dataset experiments on CMDC, AVEC2014, DAIC-WOZ, DVlog, and EATD demonstrate superior cross-domain generalization, robustness to missing modalities, and competitive performance against state-of-the-art and commercial LLMs. The approach delivers a practical, scalable solution for real-world depression screening across diverse data sources and modalities.

Abstract

Depression poses serious public health risks, including suicide, underscoring the urgency of timely and scalable screening. Multimodal automatic depression detection (ADD) offers a promising solution; however, widely studied audio- and video-based ADD methods lack a unified, generalizable framework for diverse depression recognition scenarios and show limited stability to missing modalities, which are common in real-world data. In this work, we propose a unified framework for Stable Cross-Domain Depression Recognition based on Multimodal Large Language Model (SCD-MLLM). The framework supports the integration and processing of heterogeneous depression-related data collected from varied sources while maintaining stability in the presence of incomplete modality inputs. Specifically, SCD-MLLM introduces two key components: (i) Multi-Source Data Input Adapter (MDIA), which employs masking mechanism and task-specific prompts to transform heterogeneous depression-related inputs into uniform token sequences, addressing inconsistency across diverse data sources; (ii) Modality-Aware Adaptive Fusion Module (MAFM), which adaptively integrates audio and visual features via a shared projection mechanism, enhancing resilience under missing modality conditions. e conduct comprehensive experiments under multi-dataset joint training settings on five publicly available and heterogeneous depression datasets from diverse scenarios: CMDC, AVEC2014, DAIC-WOZ, DVlog, and EATD. Across both complete and partial modality settings, SCD-MLLM outperforms state-of-the-art (SOTA) models as well as leading commercial LLMs (Gemini and GPT), demonstrating superior cross-domain generalization, enhanced ability to capture multimodal cues of depression, and strong stability to missing modality cases in real-world applications.

Paper Structure

This paper contains 36 sections, 12 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of the proposed SCD-MLLM framework. This work presents a unified cross-domain multimodal depression recognition framework with strong stability based on large language model (LLM), termed SCD-MLLM. Unlike prior approaches, SCD-MLLM transforms heterogeneous depression-related inputs into uniform token sequences through a Multi-Source Data Input Adapter (MDIA), and employs a Modality-Aware Adaptive Fusion Module (MAFM) to adaptively integrate audio-visual cues, enabling stable inference under incomplete modality conditions.
  • Figure 2: The framework of Multi-Cue Fusion video Encoder (MFVE). This module captures the high-order interactions among heterogeneous facial dynamics through crossmodal and self-attention mechanisms, enabling semantically aligned and LLM-compatible video representations for depression understanding.
  • Figure 3: The framework of the Video-Audio Fusion Module (VAFM). This module captures complementary cues through cross-modal attention and fuses audio-visual features into an unified representation for downstream LLM-based depression analysis.
  • Figure 4: Different prompt designs for different data sources.
  • Figure 5: Results of comparison of different video extraction methods. All comparisons are conducted under the T+A+V modality configuration, and F1-score is used as the performance metric. The results demonstrate the superiority of our method over existing fusion strategies, highlighting its more effective integration of heterogeneous visual cues.