DepMamba: Progressive Fusion Mamba for Multimodal Depression Detection
Jiaxin Ye, Junping Zhang, Hongming Shan
TL;DR
DepMamba tackles two core challenges in multimodal depression detection: inefficient long-range temporal modeling and sub-optimal fusion across modalities. It introduces hierarchical contextual modeling (CNN for local, Bi-Mamba for global) and a two-stage progressive fusion framework (CoSSM and EnSSM) to jointly optimize intermodal and intramodal representations. Empirical results on D-Vlog and LMVD show DepMamba surpasses state-of-the-art baselines in accuracy and recall while offering superior efficiency, notably outperforming Transformer-based approaches in long-range sequence modeling. The work highlights the value of modality-cohesive, hierarchical fusion for robust audiovisual depression detection and points to future cross-architecture hybrids to further improve cross-domain generalization.
Abstract
Depression is a common mental disorder that affects millions of people worldwide. Although promising, current multimodal methods hinge on aligned or aggregated multimodal fusion, suffering two significant limitations: (i) inefficient long-range temporal modeling, and (ii) sub-optimal multimodal fusion between intermodal fusion and intramodal processing. In this paper, we propose an audio-visual progressive fusion Mamba for multimodal depression detection, termed DepMamba. DepMamba features two core designs: hierarchical contextual modeling and progressive multimodal fusion. On the one hand, hierarchical modeling introduces convolution neural networks and Mamba to extract the local-to-global features within long-range sequences. On the other hand, the progressive fusion first presents a multimodal collaborative State Space Model (SSM) extracting intermodal and intramodal information for each modality, and then utilizes a multimodal enhanced SSM for modality cohesion. Extensive experimental results on two large-scale depression datasets demonstrate the superior performance of our DepMamba over existing state-of-the-art methods. Code is available at https://github.com/Jiaxin-Ye/DepMamba.
