Table of Contents
Fetching ...

Linguistic Indicators of Early Cognitive Decline in the DementiaBank Pitt Corpus: A Statistical and Machine Learning Study

Artsvik Avetisyan, Sachin Kumar

TL;DR

This study investigates linguistic markers of early cognitive decline using the DementiaBank Pitt Corpus across three representations: raw text, POS-enhanced, and POS-only. Employing interpretable logistic regression and random forest models, it evaluates performance under transcript-level and subject-level cross-validation, and validates findings with Mann–Whitney U tests and Cliff’s delta. Results show that lexical diversity, functional word usage, sentence structure, and discourse coherence robustly distinguish dementia from control speech, with POS-based representations often offering stronger, more generalizable signals. The work demonstrates that linguistically grounded features, validated statistically, can underpin transparent language-based cognitive screening suitable for clinical settings, while acknowledging task, demographic, and modality limitations and outlining paths for future multimodal and longitudinal extensions.

Abstract

Background: Subtle changes in spontaneous language production are among the earliest indicators of cognitive decline. Identifying linguistically interpretable markers of dementia can support transparent and clinically grounded screening approaches. Methods: This study analyzes spontaneous speech transcripts from the DementiaBank Pitt Corpus using three linguistic representations: raw cleaned text, a part-of-speech (POS)-enhanced representation combining lexical and grammatical information, and a POS-only syntactic representation. Logistic regression and random forest models were evaluated under two protocols: transcript-level train-test splits and subject-level five-fold cross-validation to prevent speaker overlap. Model interpretability was examined using global feature importance, and statistical validation was conducted using Mann-Whitney U tests with Cliff's delta effect sizes. Results: Across representations, models achieved stable performance, with syntactic and grammatical features retaining strong discriminative power even in the absence of lexical content. Subject-level evaluation yielded more conservative but consistent results, particularly for POS-enhanced and POS-only representations. Statistical analysis revealed significant group differences in functional word usage, lexical diversity, sentence structure, and discourse coherence, aligning closely with machine learning feature importance findings. Conclusion: The results demonstrate that abstract linguistic features capture robust markers of early cognitive decline under clinically realistic evaluation. By combining interpretable machine learning with non-parametric statistical validation, this study supports the use of linguistically grounded features for transparent and reliable language-based cognitive screening.

Linguistic Indicators of Early Cognitive Decline in the DementiaBank Pitt Corpus: A Statistical and Machine Learning Study

TL;DR

This study investigates linguistic markers of early cognitive decline using the DementiaBank Pitt Corpus across three representations: raw text, POS-enhanced, and POS-only. Employing interpretable logistic regression and random forest models, it evaluates performance under transcript-level and subject-level cross-validation, and validates findings with Mann–Whitney U tests and Cliff’s delta. Results show that lexical diversity, functional word usage, sentence structure, and discourse coherence robustly distinguish dementia from control speech, with POS-based representations often offering stronger, more generalizable signals. The work demonstrates that linguistically grounded features, validated statistically, can underpin transparent language-based cognitive screening suitable for clinical settings, while acknowledging task, demographic, and modality limitations and outlining paths for future multimodal and longitudinal extensions.

Abstract

Background: Subtle changes in spontaneous language production are among the earliest indicators of cognitive decline. Identifying linguistically interpretable markers of dementia can support transparent and clinically grounded screening approaches. Methods: This study analyzes spontaneous speech transcripts from the DementiaBank Pitt Corpus using three linguistic representations: raw cleaned text, a part-of-speech (POS)-enhanced representation combining lexical and grammatical information, and a POS-only syntactic representation. Logistic regression and random forest models were evaluated under two protocols: transcript-level train-test splits and subject-level five-fold cross-validation to prevent speaker overlap. Model interpretability was examined using global feature importance, and statistical validation was conducted using Mann-Whitney U tests with Cliff's delta effect sizes. Results: Across representations, models achieved stable performance, with syntactic and grammatical features retaining strong discriminative power even in the absence of lexical content. Subject-level evaluation yielded more conservative but consistent results, particularly for POS-enhanced and POS-only representations. Statistical analysis revealed significant group differences in functional word usage, lexical diversity, sentence structure, and discourse coherence, aligning closely with machine learning feature importance findings. Conclusion: The results demonstrate that abstract linguistic features capture robust markers of early cognitive decline under clinically realistic evaluation. By combining interpretable machine learning with non-parametric statistical validation, this study supports the use of linguistically grounded features for transparent and reliable language-based cognitive screening.
Paper Structure (24 sections, 2 equations, 14 figures, 9 tables)

This paper contains 24 sections, 2 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Top logistic regression features for the raw cleaned text representation
  • Figure 2: Top random forest features for the raw cleaned text representation
  • Figure 3: Top logistic regression features for the POS-enhanced representation
  • Figure 4: Top random forest features for the POS-enhanced representation
  • Figure 5: Top logistic regression feature coefficients for the POS-only representation. Positive coefficients indicate features associated with dementia transcripts, while negative coefficients are associated with control transcripts.
  • ...and 9 more figures