Leveraging Multimodal Methods and Spontaneous Speech for Alzheimer's Disease Identification
Yifan Gao, Long Guo, Hong Liu
TL;DR
This work tackles early Alzheimer's disease detection from spontaneous speech within the PROCESS Grand Challenge. It introduces a multimodal pipeline that fuses temporally aware acoustic embeddings (Whisper embeddings and Time-aware variants) with interpretable linguistic features, combining model predictions via voting to optimize both classification (healthy, MCI, dementia) and regression (MMSE) tasks. The approach achieves competitive, top-ranked performance, with a classification F1 score of $0.649$ and an MMSE RMSE of $2.628$, illustrating the value of integrating temporal acoustic cues with linguistic indicators. The results highlight the potential for scalable, speech-based screening tools that leverage multimodal signals for robust Alzheimer's disease identification in clinical and screening settings.
Abstract
Cognitive impairment detection through spontaneous speech is a promising avenue for early diagnosis of Alzheimer's disease (AD) and mild cognitive impairment (MCI), where timely intervention can significantly improve patient outcomes. The PROCESS Grand Challenge at ICASSP 2025 addresses these tasks by promoting innovative classification and regression methods for detecting cognitive decline. In this paper, we propose a multimodal fusion strategy that combines interpretable linguistic features with temporal embeddings extracted from pre-trained models. Our approach achieves an F1-score of 0.649 for the classification task (predicting healthy, MCI, dementia) and an RMSE of 2.628 for the regression task (MMSE score prediction), securing the top overall ranking in the competition.
