Table of Contents
Fetching ...

CognoSpeak: an automatic, remote assessment of early cognitive decline in real-world conversational speech

Madhurananda Pahar, Fuxiang Tao, Bahman Mirheidari, Nathan Pevy, Rebecca Bright, Swapnil Gadgil, Lise Sproson, Dorota Braun, Caitlin Illingworth, Daniel Blackburn, Heidi Christensen

TL;DR

CognoSpeak presents a remote, multimodal framework for early detection of cognitive decline using real-world conversational speech elicited by a virtual agent. By combining acoustic features with linguistic representations from foundation models, the study achieves its strongest result with DistilBERT on four tasks, attaining a $F_1$-score of 0.873 and high accuracy for healthy controls. The work demonstrates the feasibility of scalable, low-cost cognitive screening and provides a large, richly annotated dataset to advance robust, generalizable detection across dementia, MCI, and healthy aging. The authors plan to scale to balanced, longitudinal analyses and public data release to accelerate research and clinical adoption.

Abstract

The early signs of cognitive decline are often noticeable in conversational speech, and identifying those signs is crucial in dealing with later and more serious stages of neurodegenerative diseases. Clinical detection is costly and time-consuming and although there has been recent progress in the automatic detection of speech-based cues, those systems are trained on relatively small databases, lacking detailed metadata and demographic information. This paper presents CognoSpeak and its associated data collection efforts. CognoSpeak asks memory-probing long and short-term questions and administers standard cognitive tasks such as verbal and semantic fluency and picture description using a virtual agent on a mobile or web platform. In addition, it collects multimodal data such as audio and video along with a rich set of metadata from primary and secondary care, memory clinics and remote settings like people's homes. Here, we present results from 126 subjects whose audio was manually transcribed. Several classic classifiers, as well as large language model-based classifiers, have been investigated and evaluated across the different types of prompts. We demonstrate a high level of performance; in particular, we achieved an F1-score of 0.873 using a DistilBERT model to discriminate people with cognitive impairment (dementia and people with mild cognitive impairment (MCI)) from healthy volunteers using the memory responses, fluency tasks and cookie theft picture description. CognoSpeak is an automatic, remote, low-cost, repeatable, non-invasive and less stressful alternative to existing clinical cognitive assessments.

CognoSpeak: an automatic, remote assessment of early cognitive decline in real-world conversational speech

TL;DR

CognoSpeak presents a remote, multimodal framework for early detection of cognitive decline using real-world conversational speech elicited by a virtual agent. By combining acoustic features with linguistic representations from foundation models, the study achieves its strongest result with DistilBERT on four tasks, attaining a -score of 0.873 and high accuracy for healthy controls. The work demonstrates the feasibility of scalable, low-cost cognitive screening and provides a large, richly annotated dataset to advance robust, generalizable detection across dementia, MCI, and healthy aging. The authors plan to scale to balanced, longitudinal analyses and public data release to accelerate research and clinical adoption.

Abstract

The early signs of cognitive decline are often noticeable in conversational speech, and identifying those signs is crucial in dealing with later and more serious stages of neurodegenerative diseases. Clinical detection is costly and time-consuming and although there has been recent progress in the automatic detection of speech-based cues, those systems are trained on relatively small databases, lacking detailed metadata and demographic information. This paper presents CognoSpeak and its associated data collection efforts. CognoSpeak asks memory-probing long and short-term questions and administers standard cognitive tasks such as verbal and semantic fluency and picture description using a virtual agent on a mobile or web platform. In addition, it collects multimodal data such as audio and video along with a rich set of metadata from primary and secondary care, memory clinics and remote settings like people's homes. Here, we present results from 126 subjects whose audio was manually transcribed. Several classic classifiers, as well as large language model-based classifiers, have been investigated and evaluated across the different types of prompts. We demonstrate a high level of performance; in particular, we achieved an F1-score of 0.873 using a DistilBERT model to discriminate people with cognitive impairment (dementia and people with mild cognitive impairment (MCI)) from healthy volunteers using the memory responses, fluency tasks and cookie theft picture description. CognoSpeak is an automatic, remote, low-cost, repeatable, non-invasive and less stressful alternative to existing clinical cognitive assessments.
Paper Structure (13 sections, 5 figures, 2 tables)

This paper contains 13 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: CognoSpeak will expedite the stratification process for those who are concerned about their cognitive health as it is capable of accurately distinguishing between those who show signs of cognitive decline and therefore need referring to a more specialist assessment and those who have other causes of their memory problems such as depression or anxiety. Currently, data collection is ongoing in various parts of the UK, encompassing a wide range of accents and demographics. Participants are recruited nationwide through primary and secondary care, various websites such as https://www.joindementiaresearch.nihr.ac.uk/ and social media channels, including a number of memory clinics for the study.
  • Figure 2: CognoSpeak collects real-world audio and video recordings when a virtual agent prompts the subject for a diverse range of clinically proven effective tasks on multiple platforms such as mobile and web. Four avatars (2 male and 2 female) from diverse ethnic groups were used as the virtual agent. The recorded audio is then pre-processed and both acoustic and linguistic features are extracted to train and evaluate classifiers such as standard classifiers and sequence classifiers (foundation models). Finally, those with either dementia or mild cognitive impairment (MCI) are distinguished from healthy.
  • Figure 3: Distribution of the demographic information (age, ethnicity and gender) of all 126 subjects used in this study. The age distribution shows the trend of having more younger people as healthy. The ethnicity distribution shows that even though a small number of subjects represent other ethnicities, most of the participants are white British. Mixed ethnicity is noted for mixed white and black. The gender distribution is almost equally balanced across the groups.
  • Figure 4: Classification based on individual task. The $F_1$-scores are shown at the top and bottom half while using the acoustic and linguistic features for each classifier respectively. The performance is better when using the linguistic features.
  • Figure 5: Confusion matrix. The best-performing DistilBERT model, shown in Table \ref{['table:results']}, was applied to all test folds of cross-validation. Detecting MCI was the most challenging.