Temporal-Aware Iterative Speech Model for Dementia Detection
Chukwuemeka Ugwu, Oluwafemi Oyeleke
TL;DR
This paper addresses early dementia detection by leveraging temporal dynamics in speech rather than static linguistic features. It introduces TAI-Speech, a temporal-aware iterative framework that treats speech as a sequence of spectrogram frames, applying optical-flow-inspired iterative refinement with ConvGRU updates and cross-attention between acoustic and prosodic cues, followed by Transformer-based utterance aggregation. On the DementiaBank Pitt dataset, the approach achieves an AUC of 0.839 and an accuracy of about 80.6% without ASR, outperforming several text-based baselines and demonstrating the value of modeling speech production dynamics for cognitive assessment. While the link to Instrumental Activities of Daily Living (IADL) is theoretically motivated, empirical validation with longitudinal IADL data remains for future work, along with broader demographic validation and potential multimodal extensions.
Abstract
Deep learning systems often struggle with processing long sequences, where computational complexity can become a bottleneck. Current methods for automated dementia detection using speech frequently rely on static, time-agnostic features or aggregated linguistic content, lacking the flexibility to model the subtle, progressive deterioration inherent in speech production. These approaches often miss the dynamic temporal patterns that are critical early indicators of cognitive decline. In this paper, we introduce TAI-Speech, a Temporal Aware Iterative framework that dynamically models spontaneous speech for dementia detection. The flexibility of our method is demonstrated through two key innovations: 1) Optical Flow-inspired Iterative Refinement: By treating spectrograms as sequential frames, this component uses a convolutional GRU to capture the fine-grained, frame-to-frame evolution of acoustic features. 2) Cross-Attention Based Prosodic Alignment: This component dynamically aligns spectral features with prosodic patterns, such as pitch and pauses, to create a richer representation of speech production deficits linked to functional decline (IADL). TAI-Speech adaptively models the temporal evolution of each utterance, enhancing the detection of cognitive markers. Experimental results on the DementiaBank dataset show that TAI-Speech achieves a strong AUC of 0.839 and 80.6\% accuracy, outperforming text-based baselines without relying on ASR. Our work provides a more flexible and robust solution for automated cognitive assessment, operating directly on the dynamics of raw audio.
