Useful Blunders: Can Automated Speech Recognition Errors Improve Downstream Dementia Classification?

Changye Li; Weizhe Xu; Trevor Cohen; Serguei Pakhomov

Useful Blunders: Can Automated Speech Recognition Errors Improve Downstream Dementia Classification?

Changye Li, Weizhe Xu, Trevor Cohen, Serguei Pakhomov

TL;DR

The paper investigates whether automatic speech recognition errors can enhance downstream dementia classification using the Cookie Theft task. By comparing pre-trained and domain-adapted ASR models with beam-search decoding against manual transcripts, and by applying a BERT classifier to ASR-derived transcripts, the study reveals a counterintuitive finding: imperfect transcripts often yield higher classification accuracy and AUC than verbatim transcripts. SHAP-based error analysis and content-unit evaluations show that systematic ASR errors capture linguistically and acoustically informative cues related to dementia, while interpretability improves through transcript-level explanations. The results highlight a practical synergy between ASR and classification models, suggesting that carefully designed ASR pipelines could support scalable cognitive impairment screening while outlining limitations related to data quality and generalizability.

Abstract

\textbf{Objectives}: We aimed to investigate how errors from automatic speech recognition (ASR) systems affect dementia classification accuracy, specifically in the ``Cookie Theft'' picture description task. We aimed to assess whether imperfect ASR-generated transcripts could provide valuable information for distinguishing between language samples from cognitively healthy individuals and those with Alzheimer's disease (AD). \textbf{Methods}: We conducted experiments using various ASR models, refining their transcripts with post-editing techniques. Both these imperfect ASR transcripts and manually transcribed ones were used as inputs for the downstream dementia classification. We conducted comprehensive error analysis to compare model performance and assess ASR-generated transcript effectiveness in dementia classification. \textbf{Results}: Imperfect ASR-generated transcripts surprisingly outperformed manual transcription for distinguishing between individuals with AD and those without in the ``Cookie Theft'' task. These ASR-based models surpassed the previous state-of-the-art approach, indicating that ASR errors may contain valuable cues related to dementia. The synergy between ASR and classification models improved overall accuracy in dementia classification. \textbf{Conclusion}: Imperfect ASR transcripts effectively capture linguistic anomalies linked to dementia, improving accuracy in classification tasks. This synergy between ASR and classification models underscores ASR's potential as a valuable tool in assessing cognitive impairment and related clinical applications.

Useful Blunders: Can Automated Speech Recognition Errors Improve Downstream Dementia Classification?

TL;DR

Abstract

Paper Structure (22 sections, 1 equation, 5 figures, 4 tables)

This paper contains 22 sections, 1 equation, 5 figures, 4 tables.

Introduction
Methods
Data
Models
Wav2Vec2
HuBERT
BERT
Model Variants
CTC and ASR Decoding Methods
Evaluation
Error Analysis
Results
Transcript Generation Performance
Classification Performance
Error Analysis
...and 7 more sections

Figures (5)

Figure 1: The "Cookie Theft" picture description stimuli. In this task, participants are presented with this picture stimuli and are asked to describe everything they observe in the picture.
Figure 2: Overview of model development and evaluation for the downstream classification.
Figure 3: Classification performance, including accuracy (ACC) and AUC, and the corresponding 95% t-distribution confidence interval with bootstrap. Metrics was calculated on participant-level ASR-generated transcripts. The horizontal lines represent the performance of the BERT model using manually-derived transcripts, with ACC of 0.826 and AUC of 0.873. wav2vec2-base, wav2vec2-large, wav2vec2-lv60, wav2vec2-lv60-self, and hubert-large represent wav2vec2-base-960h, wav2vec2-large-960h, wav2vec2-large-960h-lv60, wav2vec2-large-lv60-self, and hubert-large-ls960-ft, respectively.
Figure 4: The visual representation of Shapley values for ADReSS ID 114-1 (healthy control), 148-0 (dementia), and 150-2 (healthy control), using pre-trained and domain-adapted hubert-large-ls960h-ft with the best-path decoding. The $f_{LABEL\_1}(inputs)$ represents the expected possibility of this transcript being produced by a dementia patient. Tokens in red represent positive impact on the prediction, whereas tokens in blue represent negative impact on the prediction. Darker color represents higher degree of the corresponding impact.
Figure 5: The visual representation of Shapley values for ADReSS ID 114-1 (healthy control), 148-0 (dementia), and 150-2 (healthy control), using pre-trained and domain-adapted hubert-large-ls960h-ft with the beam search decoding. The $f_{LABEL\_1}(inputs)$ represents the expected probability of this transcript being produced by a dementia patient. Tokens in red represent positive impact on the prediction, whereas tokens in blue represent negative impact on the prediction. Darker color represents higher degree of the corresponding impact.

Useful Blunders: Can Automated Speech Recognition Errors Improve Downstream Dementia Classification?

TL;DR

Abstract

Useful Blunders: Can Automated Speech Recognition Errors Improve Downstream Dementia Classification?

Authors

TL;DR

Abstract

Table of Contents

Figures (5)