Table of Contents
Fetching ...

CANAMRF: An Attention-Based Model for Multimodal Depression Detection

Yuntao Wei, Yuzhe Zhang, Shuyang Zhang, Hong Zhang

TL;DR

CANAMRF tackles multimodal depression detection by learning to weigh and fuse cues from textual, visual, acoustic, and a novel sentiment structural modality through an Adaptive Multimodal Recurrent Fusion (AMRF) framework and a Hybrid Attention Module. The approach introduces a sentiment structural modality to augment information and employs cross-modal and self-attention to generate discriminative representations for depression prediction. Empirical results on CMDC and EATD-Corpus demonstrate state-of-the-art performance, outperforming strong baselines across F1 and related metrics. This adaptive, attention-driven fusion framework offers a scalable path toward more accurate, real-time multimodal depression screening in practical settings.

Abstract

Multimodal depression detection is an important research topic that aims to predict human mental states using multimodal data. Previous methods treat different modalities equally and fuse each modality by naïve mathematical operations without measuring the relative importance between them, which cannot obtain well-performed multimodal representations for downstream depression tasks. In order to tackle the aforementioned concern, we present a Cross-modal Attention Network with Adaptive Multi-modal Recurrent Fusion (CANAMRF) for multimodal depression detection. CANAMRF is constructed by a multimodal feature extractor, an Adaptive Multimodal Recurrent Fusion module, and a Hybrid Attention Module. Through experimentation on two benchmark datasets, CANAMRF demonstrates state-of-the-art performance, underscoring the effectiveness of our proposed approach.

CANAMRF: An Attention-Based Model for Multimodal Depression Detection

TL;DR

CANAMRF tackles multimodal depression detection by learning to weigh and fuse cues from textual, visual, acoustic, and a novel sentiment structural modality through an Adaptive Multimodal Recurrent Fusion (AMRF) framework and a Hybrid Attention Module. The approach introduces a sentiment structural modality to augment information and employs cross-modal and self-attention to generate discriminative representations for depression prediction. Empirical results on CMDC and EATD-Corpus demonstrate state-of-the-art performance, outperforming strong baselines across F1 and related metrics. This adaptive, attention-driven fusion framework offers a scalable path toward more accurate, real-time multimodal depression screening in practical settings.

Abstract

Multimodal depression detection is an important research topic that aims to predict human mental states using multimodal data. Previous methods treat different modalities equally and fuse each modality by naïve mathematical operations without measuring the relative importance between them, which cannot obtain well-performed multimodal representations for downstream depression tasks. In order to tackle the aforementioned concern, we present a Cross-modal Attention Network with Adaptive Multi-modal Recurrent Fusion (CANAMRF) for multimodal depression detection. CANAMRF is constructed by a multimodal feature extractor, an Adaptive Multimodal Recurrent Fusion module, and a Hybrid Attention Module. Through experimentation on two benchmark datasets, CANAMRF demonstrates state-of-the-art performance, underscoring the effectiveness of our proposed approach.
Paper Structure (10 sections, 7 equations, 2 figures, 1 table)

This paper contains 10 sections, 7 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: The overall framework of CANAMRF. (a) Feature extraction procedure for multiple modalities; (b) Fusion of modalities through AMRF module; (c) Hybrid Attention Mechanism.
  • Figure 2: Subfigure (a): the framework of AMRF module. (a) Features from different modalities are projected into a same low-dimensional space by fully-connected layers; (b) The low-dimensional features are further processed by Recur operation; (c) Features are fused according to the adaptive fusion mechanism, and transformed via fully-connected layers to obtain the the final representation; Subfigure (b): the framework of Hybrid Attention Module.