Can we trust AI to detect healthy multilingual English speakers among the cognitively impaired cohort in the UK? An investigation using real-world conversational speech

Madhurananda Pahar; Caitlin Illingworth; Dorota Braun; Bahman Mirheidari; Lise Sproson; Daniel Blackburn; Heidi Christensen

Can we trust AI to detect healthy multilingual English speakers among the cognitively impaired cohort in the UK? An investigation using real-world conversational speech

Madhurananda Pahar, Caitlin Illingworth, Dorota Braun, Bahman Mirheidari, Lise Sproson, Daniel Blackburn, Heidi Christensen

TL;DR

This study interrogates the trustworthiness of AI-based cognitive decline detection in the UK’s multilingual, ethnic minority populations. By evaluating ASR performance, three-/two-class classification, and MMSE prediction on CognoMemory data (across monolingual and multilingual groups) and comparing with DementiaBank, it finds minimal ASR bias but clear biases in linguistic-feature-based models, particularly disadvantaging multilingual speakers and certain accents. The results underscore the need for bias-mitigated, generalisable models and culturally informed screening tools before deploying such AI in clinical settings. The work contributes a large, ethnically diverse real-world dataset and demonstrates the complexities of translating high-performing models from majority populations to diverse UK communities, with implications for fairer, more accessible dementia screening.

Abstract

Conversational speech often reveals early signs of cognitive decline, such as dementia and MCI. In the UK, one in four people belongs to an ethnic minority, and dementia prevalence is expected to rise most rapidly among Black and Asian communities. This study examines the trustworthiness of AI models, specifically the presence of bias, in detecting healthy multilingual English speakers among the cognitively impaired cohort, to make these tools clinically beneficial. For experiments, monolingual participants were recruited nationally (UK), and multilingual speakers were enrolled from four community centres in Sheffield and Bradford. In addition to a non-native English accent, multilinguals spoke Somali, Chinese, or South Asian languages, who were further divided into two Yorkshire accents (West and South) to challenge the efficiency of the AI tools thoroughly. Although ASR systems showed no significant bias across groups, classification and regression models using acoustic and linguistic features exhibited bias against multilingual speakers, particularly in memory, fluency, and reading tasks. This bias was more pronounced when models were trained on the publicly available DementiaBank dataset. Moreover, multilinguals were more likely to be misclassified as having cognitive decline. This study is the first of its kind to discover that, despite their strong overall performance, current AI models show bias against multilingual individuals from ethnic minority backgrounds in the UK, and they are also more likely to misclassify speakers with a certain accent (South Yorkshire) as living with a more severe cognitive decline. In this pilot study, we conclude that the existing AI tools are therefore not yet reliable for diagnostic use in these populations, and we aim to address this in future work by developing more generalisable, bias-mitigated models.

Can we trust AI to detect healthy multilingual English speakers among the cognitively impaired cohort in the UK? An investigation using real-world conversational speech

TL;DR

Abstract

Paper Structure (18 sections, 7 figures, 4 tables)

This paper contains 18 sections, 7 figures, 4 tables.

Introduction
Previous work
Data
Data collection
Data description
Experimental setup
Feature Extraction and Classifier Training
Evaluation
Results
Word Error Rate (WER)
Classification
3-way classification using CognoMemory data
2-way classification using DementiaBank data
Regression
Qualitative analysis: TF-IDF
...and 3 more sections

Figures (7)

Figure 1: Data collection and testing summary: Monolingual speakers came from all over the UK, whereas multilingual speakers were recruited from the four community centres in Sheffield and Bradford: the Sheffield Chinese Community Centre (Mandarin and Cantonese Chinese-English multilinguals), ISRAAC (Somali-English multilinguals), ShipShape (Sheffield South Asian- English multilinguals), and Meri Yaadain (Bradford South Asian-English multilinguals). South Asian languages included Hindi, Urdu, Punjabi, Mirpuri and Arabic. All participants undergo CognoMemory assessments, answering 14 memory-probing, clinically effective question prompts asked by a virtual agent. These question prompts are used to extract both acoustic and linguistic features, which are then used to train classifiers and regressors for comparison between monolingual and multilingual speakers, as described in Table \ref{['table:dataset']}.
Figure 2: Distribution of WERs across various question prompts for three ASRs. The monolingual group is noted as 'English'. It shows a trend of having a higher WER for the fluency tests ($\mathbf{Q}_{10}$, $\mathbf{Q}_{11}$). Overall, Whisper and Wav2Vec 2.0 have similar performance ($p$-value = 0.66), whereas NeMo's performance is significantly different ($p$-value =$\approx$0).
Figure 3: Accuracies obtained from classifiers on predicting monolingual (Older and Younger English) and multilingual (Somali, Chinese and Asian with West Yorkshire and South Yorkshire accents) English speakers. It shows that SVM trained by acoustic features have performed the worst, but is more consistent among mono and multilingual speakers. Whereas LLMs performed the best, but show a stronger bias, which is further confirmed by the $p$-values in Table \ref{['table:accuracies_p_val']}.
Figure 4: Accuracies of the predicted labels from the 3-way classifiers show only 16% (11% MCI and 5% dementia) have been misclassified as suffering from cognitive decline for the younger monolingual speakers, whereas those percentages rise to 35% (10% MCI and 25% dementia) for Asian multilingual speakers with a South Yorkshire accent.
Figure 5: Accuracies generated by the classifiers while trained on the DementiaBank show a similar but stronger bias than the one found in Figure \ref{['fig:accuracies_ALL']}. Classifiers performed better for monolingual than multilingual speakers.
...and 2 more figures

Can we trust AI to detect healthy multilingual English speakers among the cognitively impaired cohort in the UK? An investigation using real-world conversational speech

TL;DR

Abstract

Can we trust AI to detect healthy multilingual English speakers among the cognitively impaired cohort in the UK? An investigation using real-world conversational speech

Authors

TL;DR

Abstract

Table of Contents

Figures (7)