Learning Co-Speech Gesture for Multimodal Aphasia Type Detection

Daeun Lee; Sejung Son; Hyolim Jeon; Seungbae Kim; Jinyoung Han

Learning Co-Speech Gesture for Multimodal Aphasia Type Detection

Daeun Lee, Sejung Son, Hyolim Jeon, Seungbae Kim, Jinyoung Han

TL;DR

This work proposes a multimodal graph neural network for aphasia type detection using speech and corresponding gesture patterns, and demonstrates the superiority of this approach over existing methods, achieving state-of-the-art results.

Abstract

Aphasia, a language disorder resulting from brain damage, requires accurate identification of specific aphasia types, such as Broca's and Wernicke's aphasia, for effective treatment. However, little attention has been paid to developing methods to detect different types of aphasia. Recognizing the importance of analyzing co-speech gestures for distinguish aphasia types, we propose a multimodal graph neural network for aphasia type detection using speech and corresponding gesture patterns. By learning the correlation between the speech and gesture modalities for each aphasia type, our model can generate textual representations sensitive to gesture information, leading to accurate aphasia type detection. Extensive experiments demonstrate the superiority of our approach over existing methods, achieving state-of-the-art results (F1 84.2\%). We also show that gesture features outperform acoustic features, highlighting the significance of gesture expression in detecting aphasia types. We provide the codes for reproducibility purposes.

Learning Co-Speech Gesture for Multimodal Aphasia Type Detection

TL;DR

Abstract

Paper Structure (19 sections, 10 equations, 4 figures, 5 tables)

This paper contains 19 sections, 10 equations, 4 figures, 5 tables.

Introduction
Related Work
Aphasia Dataset
Aphasia Type Detection
Problem Statement
Speech-Gesture Graph Encoder
Heterogeneous Graph Construction
Cross-relation Aggregation
Gesture-aware Word Embedding Layer
Multimodal Fusion Encoder
Aphasia Type Prediction
Experiments
Experimental Settings
Baselines
Results
...and 4 more sections

Figures (4)

Figure 1: Variations in gestures observed across different types of aphasia. Each aligned data (text, audio, gesture) is extracted using Automatic Speech Recognition (ASR) (§\ref{['sec:dataset']}).
Figure 2: The overall architecture of the proposed model: ① Speech-Gesture Graph Encoder (§\ref{['sec:graph_encoder']}), ② Gesture-aware Word Embedding Layer (§\ref{['sec:gesture_layer']}), ③ Multimodal Fusion Encoder (§\ref{['sec:fusion_encoder']}), and ④ Aphasia Type Prediction Decoder (§\ref{['sec:decoder']}).
Figure 3: Performance of the model by the number of disfluency tokens ($m$).
Figure 4: Visualization of the crossmodal attention matrix from the V → L network of the Multimodal Fusion Encoder for the proposed model with (b) or without (c) the speech-gesture graph encoder. (a) presents the differences in the spatial position of the right wrist's landmark compared to the previous frame, where positive and negative values indicate upward/rightward and downward/leftward movements.

Learning Co-Speech Gesture for Multimodal Aphasia Type Detection

TL;DR

Abstract

Learning Co-Speech Gesture for Multimodal Aphasia Type Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (4)