A Dual-Stage Time-Context Network for Speech-Based Alzheimer's Disease Detection
Yifan Gao, Long Guo, Hong Liu
TL;DR
The paper tackles the challenge of detecting Alzheimer's disease from long-duration speech by modeling both local acoustic cues and global conversational context. It introduces the Dual-Stage Time-Context Network (DSTC-Net), which segments long recordings and uses an Intra-Segment Temporal Attention (ISTA) module to refine local features via a BiLSTM with frame-level attention, alongside a Cross-Segment Context Attention (CSCA) module to fuse segment representations into a global feature for classification. On the ADReSSo dataset, the Whisper-based DSTC-Net achieves 83.10% accuracy and 83.15% F1, outperforming prior approaches and validating the importance of integrating intra- and inter-segment information. The findings demonstrate a scalable, non-invasive screening approach that leverages both micro-level acoustic cues and macro-level discourse patterns in long speech for reliable AD detection.
Abstract
Alzheimer's disease (AD) is a progressive neurodegenerative disorder that leads to irreversible cognitive decline in memory and communication. Early detection of AD through speech analysis is crucial for delaying disease progression. However, existing methods mainly use pre-trained acoustic models for feature extraction but have limited ability to model both local and global patterns in long-duration speech. In this letter, we introduce a Dual-Stage Time-Context Network (DSTC-Net) for speech-based AD detection, integrating local acoustic features with global conversational context in long-duration recordings.We first partition each long-duration recording into fixed-length segments to reduce computational overhead and preserve local temporal details.Next, we feed these segments into an Intra-Segment Temporal Attention (ISTA) module, where a bidirectional Long Short-Term Memory (BiLSTM) network with frame-level attention extracts enhanced local features.Subsequently, a Cross-Segment Context Attention (CSCA) module applies convolution-based context modeling and adaptive attention to unify global patterns across all segments.Extensive experiments on the ADReSSo dataset show that our DSTC-Net outperforms state-of-the-art models, reaching 83.10% accuracy and 83.15% F1.
