Table of Contents
Fetching ...

A multi-modal approach for identifying schizophrenia using cross-modal attention

Gowtham Premananth, Yashish M. Siriwardena, Philip Resnik, Carol Espy-Wilson

TL;DR

This work tackles automated identification of schizophrenia with strong positive symptoms by integrating audio, video, and text cues through cross-modal attention. It combines segment-to-session CNNs for audio/video with a Hierarchical Attention Network for text, enabling self- and cross-modal fusion via dot-product attention defined as $Attention(Q,K,V)=softmax\left(\frac{QK^T}{\sqrt{d_K}}\right)V$. On a dataset of 50 interviews totaling 19.43 hours, the tri-modal model achieves state-of-the-art performance, outperforming the previous multimodal system by 8.53% in weighted F1. Key contributions include the FVTC-based coordination features, the use of a HAN for text, and a comprehensive attention-driven fusion that yields robust improvements and potential clinical utility.

Abstract

This study focuses on how different modalities of human communication can be used to distinguish between healthy controls and subjects with schizophrenia who exhibit strong positive symptoms. We developed a multi-modal schizophrenia classification system using audio, video, and text. Facial action units and vocal tract variables were extracted as low-level features from video and audio respectively, which were then used to compute high-level coordination features that served as the inputs to the audio and video modalities. Context-independent text embeddings extracted from transcriptions of speech were used as the input for the text modality. The multi-modal system is developed by fusing a segment-to-session-level classifier for video and audio modalities with a text model based on a Hierarchical Attention Network (HAN) with cross-modal attention. The proposed multi-modal system outperforms the previous state-of-the-art multi-modal system by 8.53% in the weighted average F1 score.

A multi-modal approach for identifying schizophrenia using cross-modal attention

TL;DR

This work tackles automated identification of schizophrenia with strong positive symptoms by integrating audio, video, and text cues through cross-modal attention. It combines segment-to-session CNNs for audio/video with a Hierarchical Attention Network for text, enabling self- and cross-modal fusion via dot-product attention defined as . On a dataset of 50 interviews totaling 19.43 hours, the tri-modal model achieves state-of-the-art performance, outperforming the previous multimodal system by 8.53% in weighted F1. Key contributions include the FVTC-based coordination features, the use of a HAN for text, and a comprehensive attention-driven fusion that yields robust improvements and potential clinical utility.

Abstract

This study focuses on how different modalities of human communication can be used to distinguish between healthy controls and subjects with schizophrenia who exhibit strong positive symptoms. We developed a multi-modal schizophrenia classification system using audio, video, and text. Facial action units and vocal tract variables were extracted as low-level features from video and audio respectively, which were then used to compute high-level coordination features that served as the inputs to the audio and video modalities. Context-independent text embeddings extracted from transcriptions of speech were used as the input for the text modality. The multi-modal system is developed by fusing a segment-to-session-level classifier for video and audio modalities with a text model based on a Hierarchical Attention Network (HAN) with cross-modal attention. The proposed multi-modal system outperforms the previous state-of-the-art multi-modal system by 8.53% in the weighted average F1 score.
Paper Structure (12 sections, 1 equation, 2 figures, 4 tables)

This paper contains 12 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Segment-to-session-level video classification model (audio model dimensions are denoted on top, video model dimensions are denoted below the audio model dimensions)
  • Figure 2: Multimodal classification model