Table of Contents
Fetching ...

Evaluating Spoken Language as a Biomarker for Automated Screening of Cognitive Impairment

Maria R. Lima, Alexander Capstick, Fatemeh Geranmayeh, Ramin Nilforooshan, Maja Matarić, Ravi Vaidyanathan, Payam Barnaghi

TL;DR

The paper tackles the need for scalable, non-invasive screening of cognitive impairment by leveraging spoken language biomarkers. It develops an interpretable ML pipeline, prioritizing lexical-based linguistic features and SHAP explanations to predict ADRD risk and MMSE severity from DementiaBank data, with external validation and a real-world pilot. Results show strong ADRD discrimination on a held-out DementiaBank test (ROC-AUC ~0.86) and reasonable MMSE prediction accuracy (MAE ~3.7), with careful risk stratification (Green/Red) to aid clinical triage. The work demonstrates potential for in-home cognitive health monitoring via conversational AI, while acknowledging limitations in generalizability, ASR noise, and pilot-scale validation, and outlining clear paths for extension to longitudinal, multilingual, and multi-modal settings.

Abstract

Timely and accurate assessment of cognitive impairment is a major unmet need in populations at risk. Alterations in speech and language can be early predictors of Alzheimer's disease and related dementias (ADRD) before clinical signs of neurodegeneration. Voice biomarkers offer a scalable and non-invasive solution for automated screening. However, the clinical applicability of machine learning (ML) remains limited by challenges in generalisability, interpretability, and access to patient data to train clinically applicable predictive models. Using DementiaBank recordings (N=291, 64% female), we evaluated ML techniques for ADRD screening and severity prediction from spoken language. We validated model generalisability with pilot data collected in-residence from older adults (N=22, 59% female). Risk stratification and linguistic feature importance analysis enhanced the interpretability and clinical utility of predictions. For ADRD classification, a Random Forest applied to lexical features achieved a mean sensitivity of 69.4% (95% confidence interval (CI) = 66.4-72.5) and specificity of 83.3% (78.0-88.7). On real-world pilot data, this model achieved a mean sensitivity of 70.0% (58.0-82.0) and specificity of 52.5% (39.3-65.7). For severity prediction using Mini-Mental State Examination (MMSE) scores, a Random Forest Regressor achieved a mean absolute MMSE error of 3.7 (3.7-3.8), with comparable performance of 3.3 (3.1-3.5) on pilot data. Linguistic features associated with higher ADRD risk included increased use of pronouns and adverbs, greater disfluency, reduced analytical thinking, lower lexical diversity and fewer words reflecting a psychological state of completion. Our interpretable predictive modelling offers a novel approach for in-home integration with conversational AI to monitor cognitive health and triage higher-risk individuals, enabling earlier detection and intervention.

Evaluating Spoken Language as a Biomarker for Automated Screening of Cognitive Impairment

TL;DR

The paper tackles the need for scalable, non-invasive screening of cognitive impairment by leveraging spoken language biomarkers. It develops an interpretable ML pipeline, prioritizing lexical-based linguistic features and SHAP explanations to predict ADRD risk and MMSE severity from DementiaBank data, with external validation and a real-world pilot. Results show strong ADRD discrimination on a held-out DementiaBank test (ROC-AUC ~0.86) and reasonable MMSE prediction accuracy (MAE ~3.7), with careful risk stratification (Green/Red) to aid clinical triage. The work demonstrates potential for in-home cognitive health monitoring via conversational AI, while acknowledging limitations in generalizability, ASR noise, and pilot-scale validation, and outlining clear paths for extension to longitudinal, multilingual, and multi-modal settings.

Abstract

Timely and accurate assessment of cognitive impairment is a major unmet need in populations at risk. Alterations in speech and language can be early predictors of Alzheimer's disease and related dementias (ADRD) before clinical signs of neurodegeneration. Voice biomarkers offer a scalable and non-invasive solution for automated screening. However, the clinical applicability of machine learning (ML) remains limited by challenges in generalisability, interpretability, and access to patient data to train clinically applicable predictive models. Using DementiaBank recordings (N=291, 64% female), we evaluated ML techniques for ADRD screening and severity prediction from spoken language. We validated model generalisability with pilot data collected in-residence from older adults (N=22, 59% female). Risk stratification and linguistic feature importance analysis enhanced the interpretability and clinical utility of predictions. For ADRD classification, a Random Forest applied to lexical features achieved a mean sensitivity of 69.4% (95% confidence interval (CI) = 66.4-72.5) and specificity of 83.3% (78.0-88.7). On real-world pilot data, this model achieved a mean sensitivity of 70.0% (58.0-82.0) and specificity of 52.5% (39.3-65.7). For severity prediction using Mini-Mental State Examination (MMSE) scores, a Random Forest Regressor achieved a mean absolute MMSE error of 3.7 (3.7-3.8), with comparable performance of 3.3 (3.1-3.5) on pilot data. Linguistic features associated with higher ADRD risk included increased use of pronouns and adverbs, greater disfluency, reduced analytical thinking, lower lexical diversity and fewer words reflecting a psychological state of completion. Our interpretable predictive modelling offers a novel approach for in-home integration with conversational AI to monitor cognitive health and triage higher-risk individuals, enabling earlier detection and intervention.

Paper Structure

This paper contains 28 sections, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Predicted positive cases per cognitive group. The RF-NLP model predicted probabilities for ADRD detection. The total number of samples per cognitive group (CN, mild, moderate, severe, based on MMSE) is shown considering the values from 10 bootstrap repeats. Lower MMSE values reflect worse cognition.
  • Figure 2: Risk level distribution by MMSE scores. Distribution of the Green, Amber and Red risk groups across each MMSE score on the test set for the RF model using explainable linguistic features. The prediction results are reported considering 10 bootstrap repeats.
  • Figure 3: SHAP results.a The feature importance for the top 12 most important features on the test set and their corresponding feature values from the RF-NLP model. Lower SHAP values suggest reduced risk of ADRD. The colour represents the normalised feature value, and the position in the x-axis represents the contribution that value made to the prediction. b SHAP values of a single prediction shows how each feature contributed to a correct prediction of a negative ADRD case. Here, the values on the arrows correspond to the normalised feature value in units of standard deviations away from the mean.
  • Figure 4: Model performance in severity prediction across cognitive groups. MAE for predictions on the DementiaBank test set. The error bars represent the standard deviation of the values from the 10 bootstrap repeats for each participant.
  • Figure 5: Proposed ML pipeline for cognitive health assessment. Analysis used for screening of cognitive health and MMSE prediction from spoken language.
  • ...and 5 more figures