Table of Contents
Fetching ...

A Dual-Stage Time-Context Network for Speech-Based Alzheimer's Disease Detection

Yifan Gao, Long Guo, Hong Liu

TL;DR

The paper tackles the challenge of detecting Alzheimer's disease from long-duration speech by modeling both local acoustic cues and global conversational context. It introduces the Dual-Stage Time-Context Network (DSTC-Net), which segments long recordings and uses an Intra-Segment Temporal Attention (ISTA) module to refine local features via a BiLSTM with frame-level attention, alongside a Cross-Segment Context Attention (CSCA) module to fuse segment representations into a global feature for classification. On the ADReSSo dataset, the Whisper-based DSTC-Net achieves 83.10% accuracy and 83.15% F1, outperforming prior approaches and validating the importance of integrating intra- and inter-segment information. The findings demonstrate a scalable, non-invasive screening approach that leverages both micro-level acoustic cues and macro-level discourse patterns in long speech for reliable AD detection.

Abstract

Alzheimer's disease (AD) is a progressive neurodegenerative disorder that leads to irreversible cognitive decline in memory and communication. Early detection of AD through speech analysis is crucial for delaying disease progression. However, existing methods mainly use pre-trained acoustic models for feature extraction but have limited ability to model both local and global patterns in long-duration speech. In this letter, we introduce a Dual-Stage Time-Context Network (DSTC-Net) for speech-based AD detection, integrating local acoustic features with global conversational context in long-duration recordings.We first partition each long-duration recording into fixed-length segments to reduce computational overhead and preserve local temporal details.Next, we feed these segments into an Intra-Segment Temporal Attention (ISTA) module, where a bidirectional Long Short-Term Memory (BiLSTM) network with frame-level attention extracts enhanced local features.Subsequently, a Cross-Segment Context Attention (CSCA) module applies convolution-based context modeling and adaptive attention to unify global patterns across all segments.Extensive experiments on the ADReSSo dataset show that our DSTC-Net outperforms state-of-the-art models, reaching 83.10% accuracy and 83.15% F1.

A Dual-Stage Time-Context Network for Speech-Based Alzheimer's Disease Detection

TL;DR

The paper tackles the challenge of detecting Alzheimer's disease from long-duration speech by modeling both local acoustic cues and global conversational context. It introduces the Dual-Stage Time-Context Network (DSTC-Net), which segments long recordings and uses an Intra-Segment Temporal Attention (ISTA) module to refine local features via a BiLSTM with frame-level attention, alongside a Cross-Segment Context Attention (CSCA) module to fuse segment representations into a global feature for classification. On the ADReSSo dataset, the Whisper-based DSTC-Net achieves 83.10% accuracy and 83.15% F1, outperforming prior approaches and validating the importance of integrating intra- and inter-segment information. The findings demonstrate a scalable, non-invasive screening approach that leverages both micro-level acoustic cues and macro-level discourse patterns in long speech for reliable AD detection.

Abstract

Alzheimer's disease (AD) is a progressive neurodegenerative disorder that leads to irreversible cognitive decline in memory and communication. Early detection of AD through speech analysis is crucial for delaying disease progression. However, existing methods mainly use pre-trained acoustic models for feature extraction but have limited ability to model both local and global patterns in long-duration speech. In this letter, we introduce a Dual-Stage Time-Context Network (DSTC-Net) for speech-based AD detection, integrating local acoustic features with global conversational context in long-duration recordings.We first partition each long-duration recording into fixed-length segments to reduce computational overhead and preserve local temporal details.Next, we feed these segments into an Intra-Segment Temporal Attention (ISTA) module, where a bidirectional Long Short-Term Memory (BiLSTM) network with frame-level attention extracts enhanced local features.Subsequently, a Cross-Segment Context Attention (CSCA) module applies convolution-based context modeling and adaptive attention to unify global patterns across all segments.Extensive experiments on the ADReSSo dataset show that our DSTC-Net outperforms state-of-the-art models, reaching 83.10% accuracy and 83.15% F1.

Paper Structure

This paper contains 15 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of the proposed DSTC-Net framework.
  • Figure 2: ISTA Module.
  • Figure 3: CSCA Module.
  • Figure 4: The relationship between the contextual layer depth and the performance of the extracted features for different segmentation lengths; results are obtained from Whisper (left), Wav2Vec2.0 (middle), and Hubert (right) on the ADReSSo dataset.