Table of Contents
Fetching ...

Swin-BERT: A Feature Fusion System designed for Speech-based Alzheimer's Dementia Detection

Yilin Pan, Yanpei Shi, Yijia Zhang, Mingyu Lu

TL;DR

Swin-BERT tackles automatic dementia detection from speech by leveraging both acoustic and linguistic cues while mitigating non-cognitive confounds such as age and gender. It introduces a feature fusion framework with an acoustic module based on shifted windows multi-head attention (inspired by Swin Transformer), augmented by age and gender inputs, and a linguistic module that removes rhythm-related information during transcription and incorporates character-level transcripts as input to a word-level BERT-style system. The two streams are fused for improved detection, achieving $85.58\%$ F-score on ADReSS and $87.32\%$ F-score on ADReSSo. The work demonstrates effective disentangling of confounds, rhythm-related features, and multi-modal fusion to enhance dementia detection performance and robustness.

Abstract

Speech is usually used for constructing an automatic Alzheimer's dementia (AD) detection system, as the acoustic and linguistic abilities show a decline in people living with AD at the early stages. However, speech includes not only AD-related local and global information but also other information unrelated to cognitive status, such as age and gender. In this paper, we propose a speech-based system named Swin-BERT for automatic dementia detection. For the acoustic part, the shifted windows multi-head attention that proposed to extract local and global information from images, is used for designing our acoustic-based system. To decouple the effect of age and gender on acoustic feature extraction, they are used as an extra input of the designed acoustic system. For the linguistic part, the rhythm-related information, which varies significantly between people living with and without AD, is removed while transcribing the audio recordings into transcripts. To compensate for the removed rhythm-related information, the character-level transcripts are proposed to be used as the extra input of a word-level BERT-style system. Finally, the Swin-BERT combines the acoustic features learned from our proposed acoustic-based system with our linguistic-based system. The experiments are based on the two datasets provided by the international dementia detection challenges: the ADReSS and ADReSSo. The results show that both the proposed acoustic and linguistic systems can be better or comparable with previous research on the two datasets. Superior results are achieved by the proposed Swin-BERT system on the ADReSS and ADReSSo datasets, which are 85.58\% F-score and 87.32\% F-score respectively.

Swin-BERT: A Feature Fusion System designed for Speech-based Alzheimer's Dementia Detection

TL;DR

Swin-BERT tackles automatic dementia detection from speech by leveraging both acoustic and linguistic cues while mitigating non-cognitive confounds such as age and gender. It introduces a feature fusion framework with an acoustic module based on shifted windows multi-head attention (inspired by Swin Transformer), augmented by age and gender inputs, and a linguistic module that removes rhythm-related information during transcription and incorporates character-level transcripts as input to a word-level BERT-style system. The two streams are fused for improved detection, achieving F-score on ADReSS and F-score on ADReSSo. The work demonstrates effective disentangling of confounds, rhythm-related features, and multi-modal fusion to enhance dementia detection performance and robustness.

Abstract

Speech is usually used for constructing an automatic Alzheimer's dementia (AD) detection system, as the acoustic and linguistic abilities show a decline in people living with AD at the early stages. However, speech includes not only AD-related local and global information but also other information unrelated to cognitive status, such as age and gender. In this paper, we propose a speech-based system named Swin-BERT for automatic dementia detection. For the acoustic part, the shifted windows multi-head attention that proposed to extract local and global information from images, is used for designing our acoustic-based system. To decouple the effect of age and gender on acoustic feature extraction, they are used as an extra input of the designed acoustic system. For the linguistic part, the rhythm-related information, which varies significantly between people living with and without AD, is removed while transcribing the audio recordings into transcripts. To compensate for the removed rhythm-related information, the character-level transcripts are proposed to be used as the extra input of a word-level BERT-style system. Finally, the Swin-BERT combines the acoustic features learned from our proposed acoustic-based system with our linguistic-based system. The experiments are based on the two datasets provided by the international dementia detection challenges: the ADReSS and ADReSSo. The results show that both the proposed acoustic and linguistic systems can be better or comparable with previous research on the two datasets. Superior results are achieved by the proposed Swin-BERT system on the ADReSS and ADReSSo datasets, which are 85.58\% F-score and 87.32\% F-score respectively.

Paper Structure

This paper contains 25 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Magnetization as a function of applied field. Note that "Fig." is abbreviated. There is a period after the figure number, followed by two spaces. It is good practice to explain the significance of the figure in the caption.