Cross-Linguistic Persona-Driven Data Synthesis for Robust Multimodal Cognitive Decline Detection

Rui Feng; Zhiyao Luo; Liuyu Wu; Wei Wang; Yuting Song; Yong Liu; Kok Pin Ng; Jianqing Li; Xingyao Wang

Cross-Linguistic Persona-Driven Data Synthesis for Robust Multimodal Cognitive Decline Detection

Rui Feng, Zhiyao Luo, Liuyu Wu, Wei Wang, Yuting Song, Yong Liu, Kok Pin Ng, Jianqing Li, Xingyao Wang

TL;DR

SynCog is introduced, a novel framework integrating controllable zero-shot multimodal data synthesis with Chain-of-Thought deduction fine-tuning that enables the rapid, zero-shot expansion of clinical corpora across diverse languages, effectively bypassing data bottlenecks in low-resource settings and bolstering the diagnostic performance of Multimodal Large Language Models (MLLMs).

Abstract

Speech-based digital biomarkers represent a scalable, non-invasive frontier for the early identification of Mild Cognitive Impairment (MCI). However, the development of robust diagnostic models remains impeded by acute clinical data scarcity and a lack of interpretable reasoning. Current solutions frequently struggle with cross-lingual generalization and fail to provide the transparent rationales essential for clinical trust. To address these barriers, we introduce SynCog, a novel framework integrating controllable zero-shot multimodal data synthesis with Chain-of-Thought (CoT) deduction fine-tuning. Specifically, SynCog simulates diverse virtual subjects with varying cognitive profiles to effectively alleviate clinical data scarcity. This generative paradigm enables the rapid, zero-shot expansion of clinical corpora across diverse languages, effectively bypassing data bottlenecks in low-resource settings and bolstering the diagnostic performance of Multimodal Large Language Models (MLLMs). Leveraging this synthesized dataset, we fine-tune a foundational multimodal backbone using a CoT deduction strategy, empowering the model to explicitly articulate diagnostic thought processes rather than relying on black-box predictions. Extensive experiments on the ADReSS and ADReSSo benchmarks demonstrate that augmenting limited clinical data with synthetic phenotypes yields competitive diagnostic performance, achieving Macro-F1 scores of 80.67% and 78.46%, respectively, outperforming current baseline models. Furthermore, evaluation on an independent real-world Mandarin cohort (CIR-E) demonstrates robust cross-linguistic generalization, attaining a Macro-F1 of 48.71%. These findings constitute a critical step toward providing clinically trustworthy and linguistically inclusive cognitive assessment tools for global healthcare.

Cross-Linguistic Persona-Driven Data Synthesis for Robust Multimodal Cognitive Decline Detection

TL;DR

Abstract

Paper Structure (2 sections, 11 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 2 sections, 11 equations, 7 figures, 6 tables, 1 algorithm.

Prompt Design
Model Card

Figures (7)

Figure 1: Overview of the proposed SynCog framework. The pipeline consists of four phases: (a) Data Synthesis, which involves simulating subjects via LLMs to generate high-fidelity multimodal narratives comprising both audio recordings and their corresponding transcripts, conditioned on demographic and cognitive attributes; (b) Chain-of-Thought Distillation, where diagnostic rationales are systematically derived to bridge raw multimodal evidence with clinical labels; (c) CoT deduction Fine-Tuning, utilizing Low-Rank Adaptation (LoRA) to optimize the model $\mathcal{M}$ based on the diagnostic reasoning sequences derived during CoT Distillation; and (d) Diagnostic Inference, where the fine-tuned model $\mathcal{M}$ first articulates a heuristic reasoning sequence $r_j$ to ground the final assessment in specific pathological markers. By synthesizing evidence from acoustic prosody and linguistic content, this transparent inference pathway effectively mitigates the risk of shortcut learning. The resulting diagnostic process not only enhances the robustness of results across different linguistic contexts but also provides clinicians with interpretable evidentiary support to validate the final assessment.
Figure 2: Distribution of key linguistic biomarkers across synthetic cohorts generated by SynCog. Box plots quantify four representative features mapped to the assessment dimensions and scoring criteria: Total Word Count, Frequency of Spatial Terms, Frequency of Filler Words, and Frequency of Vague Terms. These metrics characterize the linguistic profiles of the generated cohorts across the diagnostic spectrum. (a) Synthetic English dataset contrasting patients with Alzheimer’s disease (AD) and non-AD individuals. (b) Synthetic Mandarin dataset across three diagnostic categories: healthy controls (HC), mild cognitive impairment (MCI), and AD. The distinct separation between neurodegenerative groups and healthy controls confirms that the generated text preserves pathological patterns consistent with the established scoring criteria and clinical diagnostic standards.
Figure 3: Distributional alignment of synthetic and clinical acoustic embeddings. The t-SNE visualization displays high-dimensional feature vectors extracted from speech samples using the wav2vec2-base-960h model. The plot reveals two distinct linguistic clusters where the synthetic data shares a substantial manifold overlap with real-world recordings. The English cluster comprises the ADReSS and ADReSSo clinical baselines aligned with the synthetic English dataset. The Mandarin cluster includes the CIR-E real-world cohort aligned with the synthetic Mandarin cohort. The dashed boundaries indicate the distributional extent of each subgroup and highlight the preservation of language-specific acoustic phenotypes by the SynCog framework.
Figure 4: Impact of data augmentation scaling on diagnostic performance. The line graph illustrates the Average F1 score trajectories for the ADReSS, ADReSSo, and CIR-E datasets as the augmentation ratio scales from zero to five times the baseline. The error bars represent the standard deviation across experimental runs. The observed trends demonstrate a rapid initial performance improvement followed by a plateau or slight decline, indicating the existence of an optimal threshold for synthetic data integration.
Figure S1: Prompt Template for Persona-Based Text Generation. The prompt conditions the text generation on specific demographic attributes and a discrete linguistic style vector. This mechanism enforces the production of narratives that reflect specific cognitive deficits rather than generic descriptions.
...and 2 more figures

Cross-Linguistic Persona-Driven Data Synthesis for Robust Multimodal Cognitive Decline Detection

TL;DR

Abstract

Cross-Linguistic Persona-Driven Data Synthesis for Robust Multimodal Cognitive Decline Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (7)