Table of Contents
Fetching ...

A Novel Audio-Visual Information Fusion System for Mental Disorders Detection

Yichun Li, Shuanglin Li, Syed Mohsen Naqvi

TL;DR

This paper focuses on the emotional expression features of mental disorders and introduces a multimodal mental disorder diagnosis system based on audio-visual information input that achieves over 80% accuracy on the real multimodal ADHD dataset and achieves state-of-the-art results on the depression dataset AVEC 2014.

Abstract

Mental disorders are among the foremost contributors to the global healthcare challenge. Research indicates that timely diagnosis and intervention are vital in treating various mental disorders. However, the early somatization symptoms of certain mental disorders may not be immediately evident, often resulting in their oversight and misdiagnosis. Additionally, the traditional diagnosis methods incur high time and cost. Deep learning methods based on fMRI and EEG have improved the efficiency of the mental disorder detection process. However, the cost of the equipment and trained staff are generally huge. Moreover, most systems are only trained for a specific mental disorder and are not general-purpose. Recently, physiological studies have shown that there are some speech and facial-related symptoms in a few mental disorders (e.g., depression and ADHD). In this paper, we focus on the emotional expression features of mental disorders and introduce a multimodal mental disorder diagnosis system based on audio-visual information input. Our proposed system is based on spatial-temporal attention networks and innovative uses a less computationally intensive pre-train audio recognition network to fine-tune the video recognition module for better results. We also apply the unified system for multiple mental disorders (ADHD and depression) for the first time. The proposed system achieves over 80\% accuracy on the real multimodal ADHD dataset and achieves state-of-the-art results on the depression dataset AVEC 2014.

A Novel Audio-Visual Information Fusion System for Mental Disorders Detection

TL;DR

This paper focuses on the emotional expression features of mental disorders and introduces a multimodal mental disorder diagnosis system based on audio-visual information input that achieves over 80% accuracy on the real multimodal ADHD dataset and achieves state-of-the-art results on the depression dataset AVEC 2014.

Abstract

Mental disorders are among the foremost contributors to the global healthcare challenge. Research indicates that timely diagnosis and intervention are vital in treating various mental disorders. However, the early somatization symptoms of certain mental disorders may not be immediately evident, often resulting in their oversight and misdiagnosis. Additionally, the traditional diagnosis methods incur high time and cost. Deep learning methods based on fMRI and EEG have improved the efficiency of the mental disorder detection process. However, the cost of the equipment and trained staff are generally huge. Moreover, most systems are only trained for a specific mental disorder and are not general-purpose. Recently, physiological studies have shown that there are some speech and facial-related symptoms in a few mental disorders (e.g., depression and ADHD). In this paper, we focus on the emotional expression features of mental disorders and introduce a multimodal mental disorder diagnosis system based on audio-visual information input. Our proposed system is based on spatial-temporal attention networks and innovative uses a less computationally intensive pre-train audio recognition network to fine-tune the video recognition module for better results. We also apply the unified system for multiple mental disorders (ADHD and depression) for the first time. The proposed system achieves over 80\% accuracy on the real multimodal ADHD dataset and achieves state-of-the-art results on the depression dataset AVEC 2014.
Paper Structure (10 sections, 3 equations, 6 figures, 5 tables)

This paper contains 10 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Age (years) and gender distribution of patients with a diagnosis of severe mental illness (SMI) compared with all patients recorded by the United Kingdom National Health Service (NHS), UK UKNHS.
  • Figure 2: The two-dimensional emotion space, which can be divided into four quadrants. Each quadrant is associated with various emotions.
  • Figure 3: Illustration of the proposed mental disorders assessment and detection system. The spectral-temporal feature from the video is extracted with the Cov-Attention module, 1D-convolutional layers, and spatial-temporal networks. We use a pre-train Attention-CNN module based on audio to fine-tune the video input. The $\oplus$ symbol denotes the concatenation operation for the fusion loss. The variables $\ell _{S}$, $\ell _{V}$, and $\ell _{B}$ denote the loss of the audio-based recognition model, the loss of the video-based recognition model and the fusion background loss, respectively.
  • Figure 4: Illustration of the facial reactions closely associated with mental disorders, i.e., ADHD (a) and depression (b), randomly selected subjects from multimodal ADHD data and AVEC 2014 data, respectively.
  • Figure 5: Illustration of the audio spectrogram from a randomly selected ADHD subject, depression subject, and neurotypical control.
  • ...and 1 more figures