Table of Contents
Fetching ...

OPTIMUS: Predicting Multivariate Outcomes in Alzheimer's Disease Using Multi-modal Data amidst Missing Values

Christelle Schneuwly Diaz, Duy-Thanh Vu, Julien Bodelet, Duy-Cat Can, Guillaume Blanc, Haiting Jiang, Lin Yao, Guiseppe Pantaleo, ADNI, Oliver Y. Chén

TL;DR

OPTIMUS tackles the many-to-many challenge of predicting multivariate Alzheimer's disease outcomes from multimodal data with missing values. It integrates modality-specific missing-data imputation, a TabNet-based multivariate predictor, and permutation-based explainability to map biomarkers to four cognitive domains. The framework identifies differential, biologically meaningful biomarkers across MRI, CSF, and transcriptomic data, with APOE $\epsilon$4 repeatedly emerging as a key predictor and imaging features localized to anatomically plausible regions. The results demonstrate improved predictive accuracy with multimodal data over any single modality and showcase the potential for interpretable, mechanistic insight into AD progression that could inform clinical decision-making.

Abstract

Alzheimer's disease, a neurodegenerative disorder, is associated with neural, genetic, and proteomic factors while affecting multiple cognitive and behavioral faculties. Traditional AD prediction largely focuses on univariate disease outcomes, such as disease stages and severity. Multimodal data encode broader disease information than a single modality and may, therefore, improve disease prediction; but they often contain missing values. Recent "deeper" machine learning approaches show promise in improving prediction accuracy, yet the biological relevance of these models needs to be further charted. Integrating missing data analysis, predictive modeling, multimodal data analysis, and explainable AI, we propose OPTIMUS, a predictive, modular, and explainable machine learning framework, to unveil the many-to-many predictive pathways between multimodal input data and multivariate disease outcomes amidst missing values. OPTIMUS first applies modality-specific imputation to uncover data from each modality while optimizing overall prediction accuracy. It then maps multimodal biomarkers to multivariate outcomes using machine-learning and extracts biomarkers respectively predictive of each outcome. Finally, OPTIMUS incorporates XAI to explain the identified multimodal biomarkers. Using data from 346 cognitively normal subjects, 608 persons with mild cognitive impairment, and 251 AD patients, OPTIMUS identifies neural and transcriptomic signatures that jointly but differentially predict multivariate outcomes related to executive function, language, memory, and visuospatial function. Our work demonstrates the potential of building a predictive and biologically explainable machine-learning framework to uncover multimodal biomarkers that capture disease profiles across varying cognitive landscapes. The results improve our understanding of the complex many-to-many pathways in AD.

OPTIMUS: Predicting Multivariate Outcomes in Alzheimer's Disease Using Multi-modal Data amidst Missing Values

TL;DR

OPTIMUS tackles the many-to-many challenge of predicting multivariate Alzheimer's disease outcomes from multimodal data with missing values. It integrates modality-specific missing-data imputation, a TabNet-based multivariate predictor, and permutation-based explainability to map biomarkers to four cognitive domains. The framework identifies differential, biologically meaningful biomarkers across MRI, CSF, and transcriptomic data, with APOE 4 repeatedly emerging as a key predictor and imaging features localized to anatomically plausible regions. The results demonstrate improved predictive accuracy with multimodal data over any single modality and showcase the potential for interpretable, mechanistic insight into AD progression that could inform clinical decision-making.

Abstract

Alzheimer's disease, a neurodegenerative disorder, is associated with neural, genetic, and proteomic factors while affecting multiple cognitive and behavioral faculties. Traditional AD prediction largely focuses on univariate disease outcomes, such as disease stages and severity. Multimodal data encode broader disease information than a single modality and may, therefore, improve disease prediction; but they often contain missing values. Recent "deeper" machine learning approaches show promise in improving prediction accuracy, yet the biological relevance of these models needs to be further charted. Integrating missing data analysis, predictive modeling, multimodal data analysis, and explainable AI, we propose OPTIMUS, a predictive, modular, and explainable machine learning framework, to unveil the many-to-many predictive pathways between multimodal input data and multivariate disease outcomes amidst missing values. OPTIMUS first applies modality-specific imputation to uncover data from each modality while optimizing overall prediction accuracy. It then maps multimodal biomarkers to multivariate outcomes using machine-learning and extracts biomarkers respectively predictive of each outcome. Finally, OPTIMUS incorporates XAI to explain the identified multimodal biomarkers. Using data from 346 cognitively normal subjects, 608 persons with mild cognitive impairment, and 251 AD patients, OPTIMUS identifies neural and transcriptomic signatures that jointly but differentially predict multivariate outcomes related to executive function, language, memory, and visuospatial function. Our work demonstrates the potential of building a predictive and biologically explainable machine-learning framework to uncover multimodal biomarkers that capture disease profiles across varying cognitive landscapes. The results improve our understanding of the complex many-to-many pathways in AD.

Paper Structure

This paper contains 35 sections, 15 equations, 18 figures, 12 tables.

Figures (18)

  • Figure 1: The schematic representation of the OPTIMUS architecture. From left to right. (1) Neuroimaging (structural MRI), genetic (APOE genotype), Blood transcriptomics (RNA sequencing or RNA-seq), proteomic (CSF data including phosphorylated tau (p-tau), total tau (t-tau), and amyloid beta (A$\beta$)), and demographic information undergo pre-processing and feature extraction, and enter the OPTIMUS model. (2) OPTIMUS performs modality-specific missing data imputation. It generates imputed data for each modality whose modality-specific distributions resemble those of the observed data. (3) OPTIMUS performs many-to-many prediction. It predicts multivariate outcomes using multimodal data. (4) Explaining selected biomarkers via explainable AI (XAI). OPTIMUS numerically quantifies the explainability of the selected features using permutation importance scores, and, by plotting the weights of the top features back to the anatomical space, it assesses their clinical and pathological relevance.
  • Figure 2: Intra- and inter-modality feature similarity and their relationship to the multivariate outcomes. Multimodal data show strong intra-modality similarity but weaker inter-modality similarity. This suggests that features from different modalities potentially offer complemental information. Concurrently, each modality contains features strongly associated with the outcomes; but for each outcome, the most relevant features vary within- and across-modality. This suggests there are potential multimodal biomarkers differentially predictive of the outcomes. Each entry of the matrix represents Pearson correlation between paired features, or that between a feature and an outcome. The colors of the rows and columns correspond to data modalities: MRI (cortical thickness) in yellow, RNA (blood transcriptomics) in orange, CSF in blue DNA in red, and cognitive domain scores in pink. Variables are: 200 brain regions of interest (MRI), 54 gene counts (RNA), 3 protein quantifications (CSF), 3 allele counts (DNA), and 4 cognitive domain scores. A hierarchical clustering was performed modality-wise. A further sub-clustering of the MRI data based on functional brain regions is in Fig. \ref{['fig:mri_correlation']}.
  • Figure 3: Inter-group feature and outcome differences. (a) Cortical thickness by functional brain network across CN, MCI and AD groups. The Cortical thickness generally decreases as AD progresses but differential across brain regions. (b) CSF biomarkers between CN, MCI and AD group. As AD advances, p-tau and tau increases and A$\beta$ decreases. (c) Genes differentially expressed in CN and AD. CN-MCI and between MCI-AD differentially expressed genes are in Fig. \ref{['fig:results_dge']} (d) Multivariate cognitive scores across CN, MCI and AD groups. As AD progresses, scores related to memory, executive function, visuospatial, and language become worsen.
  • Figure 4: Multimodal missing data analysis. (a) Percentage of available data across modalities. (b) Modality-specific missing data imputation. First row contains histograms and kernel density estimation (KDE) curves for A$\beta$, phosphorylated tau (p-tau), and t-tau from CSF, and AGR1, CD300E and CLU gene counts from blood RNA-seq expression data. The first column corresponds to observed data; each subsequent column represent distributions from a specific imputer (see Table \ref{['tab:imputation']}), ranked based on the KL-distance between the observed and imputed distributions. Within each subplot, the x-axis represents the feature values and the y-axis shows frequency; the color codes for the distributions are: red = AD, orange = MCI, and blue = CN.
  • Figure 5: Interpret multimodal biomarkers predictive of multivariate outcomes via explainable AI (XAI). Top Panel. The scatter plots from left to right show predicted scores (Y-axis) against observed scores (X-axis) for executive function, language, memory, and visuospatial score, respectively. Blue =CN; yellow =MCI; red =AD. Each line assesses the goodness-of-fit, and the shaded band indicates the 95% confidence interval. Bottom panel. Each bar chart indicates the top 20 features with highest feature importance scores, derived from a permutation test (averaged over 10 iterations of feature shuffling), for predicting executive function, language, memory, and visualization function, respectively. Bars were colored according to their modality (MRI - cortical thickness: yellow; DNA - APOE genotype: red, RNA - blood transcriptomics: orange). The top neuroimaging features were further projected to the Schaefer Atlas (200 regions across 7 networks) in four views.
  • ...and 13 more figures