CANAMRF: An Attention-Based Model for Multimodal Depression Detection
Yuntao Wei, Yuzhe Zhang, Shuyang Zhang, Hong Zhang
TL;DR
CANAMRF tackles multimodal depression detection by learning to weigh and fuse cues from textual, visual, acoustic, and a novel sentiment structural modality through an Adaptive Multimodal Recurrent Fusion (AMRF) framework and a Hybrid Attention Module. The approach introduces a sentiment structural modality to augment information and employs cross-modal and self-attention to generate discriminative representations for depression prediction. Empirical results on CMDC and EATD-Corpus demonstrate state-of-the-art performance, outperforming strong baselines across F1 and related metrics. This adaptive, attention-driven fusion framework offers a scalable path toward more accurate, real-time multimodal depression screening in practical settings.
Abstract
Multimodal depression detection is an important research topic that aims to predict human mental states using multimodal data. Previous methods treat different modalities equally and fuse each modality by naïve mathematical operations without measuring the relative importance between them, which cannot obtain well-performed multimodal representations for downstream depression tasks. In order to tackle the aforementioned concern, we present a Cross-modal Attention Network with Adaptive Multi-modal Recurrent Fusion (CANAMRF) for multimodal depression detection. CANAMRF is constructed by a multimodal feature extractor, an Adaptive Multimodal Recurrent Fusion module, and a Hybrid Attention Module. Through experimentation on two benchmark datasets, CANAMRF demonstrates state-of-the-art performance, underscoring the effectiveness of our proposed approach.
