MINT: Multimodal Imaging-to-Speech Knowledge Transfer for Early Alzheimer's Screening

Vrushank Ahire; Yogesh Kumar; Anouck Girard; M. A. Ganaie

MINT: Multimodal Imaging-to-Speech Knowledge Transfer for Early Alzheimer's Screening

Vrushank Ahire, Yogesh Kumar, Anouck Girard, M. A. Ganaie

TL;DR

To the authors' knowledge, this is the first demonstration of MRI-to-speech knowledge transfer for early Alzheimer's screening, establishing a biologically grounded pathway for population-level cognitive triage without neuroimaging at inference.

Abstract

Alzheimer's disease is a progressive neurodegenerative disorder in which mild cognitive impairment (MCI) marks a critical transition between aging and dementia. Neuroimaging modalities, such as structural MRI, provide biomarkers of this transition; however, their high costs and infrastructure needs limit their deployment at a population scale. Speech analysis offers a non-invasive alternative, but speech-only classifiers are developed independently of neuroimaging, leaving decision boundaries biologically ungrounded and limiting reliability on the subtle CN-versus-MCI distinction. We propose MINT (Multimodal Imaging-to-Speech Knowledge Transfer), a three-stage cross-modal framework that transfers biomarker structure from MRI into a speech encoder at training time. An MRI teacher, trained on 1,228 subjects, defines a compact neuroimaging embedding space for CN-versus-MCI classification. A residual projection head aligns speech representations to this frozen imaging manifold via a combined geometric loss, adapting speech to the learned biomarker space while preserving imaging encoder fidelity. The frozen MRI classifier, which is never exposed to speech, is applied to aligned embeddings at inference and requires no scanner. Evaluation on ADNI-4 shows aligned speech achieves performance comparable to speech-only baselines (AUC 0.720 vs 0.711) while requiring no imaging at inference, demonstrating that MRI-derived decision boundaries can ground speech representations. Multimodal fusion improves over MRI alone (0.973 vs 0.958). Ablation studies identify dropout regularization and self-supervised pretraining as critical design decisions. To our knowledge, this is the first demonstration of MRI-to-speech knowledge transfer for early Alzheimer's screening, establishing a biologically grounded pathway for population-level cognitive triage without neuroimaging at inference.

MINT: Multimodal Imaging-to-Speech Knowledge Transfer for Early Alzheimer's Screening

TL;DR

Abstract

Paper Structure (11 sections, 3 equations, 2 figures, 2 tables)

This paper contains 11 sections, 3 equations, 2 figures, 2 tables.

Introduction
Methodology
Problem Setup and Notation
Stage 1: Speech Encoder Pretraining and Fine-tuning
Stage 2: MRI Feature Extraction and Teacher Training
Stage 3: Cross-Modal Alignment
Experiments and Results
Dataset and Unified Evaluation Protocol
Main Results
Ablation Studies
Discussion and Conclusion

Figures (2)

Figure 1: Overview of the three-stage MINT framework. In Stage 1, a speech encoder is pretrained using masked autoencoding and then fine-tuned for CN–MCI classification. In Stage 2, an MRI teacher is trained via tissue-stratified feature extraction to learn a 128-dimensional biomarker embedding space. In Stage 3, a projection head aligns speech embeddings to the frozen MRI space, enabling speech-only inference as well as multimodal fusion.
Figure 2: PCA visualization shows higher class overlap in speech embeddings, whereas MRI embeddings display better separation between CN (blue) and MCI (red). This modality gap motivates the cross-modal alignment objective.

MINT: Multimodal Imaging-to-Speech Knowledge Transfer for Early Alzheimer's Screening

TL;DR

Abstract

MINT: Multimodal Imaging-to-Speech Knowledge Transfer for Early Alzheimer's Screening

Authors

TL;DR

Abstract

Table of Contents

Figures (2)