Table of Contents
Fetching ...

CogBench: A Large Language Model Benchmark for Multilingual Speech-Based Cognitive Impairment Assessment

Rui Feng, Zhiyao Luo, Wei Wang, Yuting Song, Yong Liu, Tingting Zhu, Jianqing Li, Xingyao Wang

TL;DR

CogBench introduces the first cross-lingual, cross-site benchmark for speech-based cognitive impairment assessment, unifying English and Mandarin datasets with a new CIR-E test set. By converting cognitive status inference into structured text generation for LLMs, the study systematically compares small speech models and multimodal LLMs under zero-shot prompting, prompting enhancements (CoT, EXP), and LoRA-based fine-tuning. Key findings show conventional SSMs poorly generalize across domains, while LLMs with chain-of-thought prompting offer better adaptability; LoRA fine-tuning significantly improves cross-domain generalization, especially on CIR-E. The work provides datasets, code, and evaluation scripts to promote clinically robust, linguistically aware cognitive screening tools with real-world applicability.

Abstract

Automatic assessment of cognitive impairment from spontaneous speech offers a promising, non-invasive avenue for early cognitive screening. However, current approaches often lack generalizability when deployed across different languages and clinical settings, limiting their practical utility. In this study, we propose CogBench, the first benchmark designed to evaluate the cross-lingual and cross-site generalizability of large language models (LLMs) for speech-based cognitive impairment assessment. Using a unified multimodal pipeline, we evaluate model performance on three speech datasets spanning English and Mandarin: ADReSSo, NCMMSC2021-AD, and a newly collected test set, CIR-E. Our results show that conventional deep learning models degrade substantially when transferred across domains. In contrast, LLMs equipped with chain-of-thought prompting demonstrate better adaptability, though their performance remains sensitive to prompt design. Furthermore, we explore lightweight fine-tuning of LLMs via Low-Rank Adaptation (LoRA), which significantly improves generalization in target domains. These findings offer a critical step toward building clinically useful and linguistically robust speech-based cognitive assessment tools.

CogBench: A Large Language Model Benchmark for Multilingual Speech-Based Cognitive Impairment Assessment

TL;DR

CogBench introduces the first cross-lingual, cross-site benchmark for speech-based cognitive impairment assessment, unifying English and Mandarin datasets with a new CIR-E test set. By converting cognitive status inference into structured text generation for LLMs, the study systematically compares small speech models and multimodal LLMs under zero-shot prompting, prompting enhancements (CoT, EXP), and LoRA-based fine-tuning. Key findings show conventional SSMs poorly generalize across domains, while LLMs with chain-of-thought prompting offer better adaptability; LoRA fine-tuning significantly improves cross-domain generalization, especially on CIR-E. The work provides datasets, code, and evaluation scripts to promote clinically robust, linguistically aware cognitive screening tools with real-world applicability.

Abstract

Automatic assessment of cognitive impairment from spontaneous speech offers a promising, non-invasive avenue for early cognitive screening. However, current approaches often lack generalizability when deployed across different languages and clinical settings, limiting their practical utility. In this study, we propose CogBench, the first benchmark designed to evaluate the cross-lingual and cross-site generalizability of large language models (LLMs) for speech-based cognitive impairment assessment. Using a unified multimodal pipeline, we evaluate model performance on three speech datasets spanning English and Mandarin: ADReSSo, NCMMSC2021-AD, and a newly collected test set, CIR-E. Our results show that conventional deep learning models degrade substantially when transferred across domains. In contrast, LLMs equipped with chain-of-thought prompting demonstrate better adaptability, though their performance remains sensitive to prompt design. Furthermore, we explore lightweight fine-tuning of LLMs via Low-Rank Adaptation (LoRA), which significantly improves generalization in target domains. These findings offer a critical step toward building clinically useful and linguistically robust speech-based cognitive assessment tools.

Paper Structure

This paper contains 31 sections, 5 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: The overall workflow of our approach includes three key steps: (1) data preprocessing—performing speaker diarization and ASR to produce clean audio and transcripts from multiple datasets; (2) applying designed prompts to LLMs for cognitive status inference from multimodal inputs; and (3) fine-tuning LLMs via Low-Rank Adaptation (LoRA) on formatted data, followed by evaluation of the fine-tuned models to obtain final predictions.
  • Figure 2: Comprehensive evaluation of LLMs on Avg@1 and Maj@5 metrics. (a) shows the overall performance of different LLMs averaged across three datasets. (b) shows detailed performance of each LLM on individual datasets.
  • Figure 3: Performance comparison of LLM w/ and w/o LoRA against best SSM across three datasets.
  • Figure 4: The Maj@K with different majority voting times K for three models under TTS.
  • Figure 5: Demographic distribution of participants in the CIR-E dataset. (a) Age distribution across diagnostic categories; no significant differences were observed between groups. (b) Gender distribution across categories. (c) Distribution of education levels, including illiterate, primary school, secondary school, and tertiary education (college or above). This figure illustrates the balance in key demographic variables, helping to control for potential confounding effects.
  • ...and 4 more figures