Table of Contents
Fetching ...

AdapDISCOM: An Adaptive Sparse Regression Method for High-Dimensional Multimodal Data With Block-Wise Missingness and Measurement Errors

Maimouna Baldé, Abdoul O. Diakité, Claudia Moreau, Gleb Bezgin, Nikhil Bhagwat, Pedro Rosa-Neto, Jean-Baptiste Poline, Simon Girard, Amadou Barry

TL;DR

AdapDISCOM presents an adaptive sparse regression framework that jointly addresses block-wise missingness and additive measurement error in high-dimensional multimodal data. By deriving modality-specific covariance weighting and providing robust (Huber) and fast variants, it achieves superior prediction and reliable biomarker selection across heterogeneous settings, as demonstrated in extensive simulations and ADNI data. The work offers strong theoretical guarantees, including convergence, model selection consistency, and efficient prediction, while delivering scalable software for practical use in biomedical research. This approach enhances the reliability of multimodal analyses in the presence of realistic data imperfections, enabling more accurate inference and biomarker discovery.

Abstract

Multimodal high-dimensional data are increasingly prevalent in biomedical research, yet they are often compromised by block-wise missingness and measurement errors, posing significant challenges for statistical inference and prediction. We propose AdapDISCOM, a novel adaptive direct sparse regression method that simultaneously addresses these two pervasive issues. Building on the DISCOM framework, AdapDISCOM introduces modality-specific weighting schemes to account for heterogeneity in data structures and error magnitudes across modalities. We establish the theoretical properties of AdapDISCOM, including model selection consistency and convergence rates under sub-Gaussian and heavy-tailed settings, and develop robust and computationally efficient variants (AdapDISCOM-Huber and Fast-AdapDISCOM). Extensive simulations demonstrate that AdapDISCOM consistently outperforms existing methods such as DISCOM, SCOM, and CoCoLasso, particularly under heterogeneous contamination and heavy-tailed distributions. Finally, we apply AdapDISCOM to Alzheimers Disease Neuroimaging Initiative (ADNI) data, demonstrating improved prediction of cognitive scores and reliable selection of established biomarkers, even with substantial missingness and measurement errors. AdapDISCOM provides a flexible, robust, and scalable framework for high-dimensional multimodal data analysis under realistic data imperfections.

AdapDISCOM: An Adaptive Sparse Regression Method for High-Dimensional Multimodal Data With Block-Wise Missingness and Measurement Errors

TL;DR

AdapDISCOM presents an adaptive sparse regression framework that jointly addresses block-wise missingness and additive measurement error in high-dimensional multimodal data. By deriving modality-specific covariance weighting and providing robust (Huber) and fast variants, it achieves superior prediction and reliable biomarker selection across heterogeneous settings, as demonstrated in extensive simulations and ADNI data. The work offers strong theoretical guarantees, including convergence, model selection consistency, and efficient prediction, while delivering scalable software for practical use in biomedical research. This approach enhances the reliability of multimodal analyses in the presence of realistic data imperfections, enabling more accurate inference and biomarker discovery.

Abstract

Multimodal high-dimensional data are increasingly prevalent in biomedical research, yet they are often compromised by block-wise missingness and measurement errors, posing significant challenges for statistical inference and prediction. We propose AdapDISCOM, a novel adaptive direct sparse regression method that simultaneously addresses these two pervasive issues. Building on the DISCOM framework, AdapDISCOM introduces modality-specific weighting schemes to account for heterogeneity in data structures and error magnitudes across modalities. We establish the theoretical properties of AdapDISCOM, including model selection consistency and convergence rates under sub-Gaussian and heavy-tailed settings, and develop robust and computationally efficient variants (AdapDISCOM-Huber and Fast-AdapDISCOM). Extensive simulations demonstrate that AdapDISCOM consistently outperforms existing methods such as DISCOM, SCOM, and CoCoLasso, particularly under heterogeneous contamination and heavy-tailed distributions. Finally, we apply AdapDISCOM to Alzheimers Disease Neuroimaging Initiative (ADNI) data, demonstrating improved prediction of cognitive scores and reliable selection of established biomarkers, even with substantial missingness and measurement errors. AdapDISCOM provides a flexible, robust, and scalable framework for high-dimensional multimodal data analysis under realistic data imperfections.

Paper Structure

This paper contains 25 sections, 5 theorems, 67 equations, 4 figures.

Key Result

Proposition 1

Consider the following optimization problem where the weights $\alpha_k, \ k=1,2,\ldots, K, \ \alpha_C,$ and $\alpha_p$ are nonrandom. Denote for, $k=1,2,\ldots, K, \ \delta_{C}^2 = \mathop{\mathrm{\mathbb{E}}}\nolimits[\|\widetilde{\Sigma}_{C} - \Sigma_{C}\|^2_F], \ \delta_{I_k}^2 = \mathop{\mathrm{\mathbb{E}}}\nolimits[\|\widetilde{\Sigma}_ The optimal weights are, for $k=1,2,\ldots, K,$ In add

Figures (4)

  • Figure 1: Mean squared error (MSE) of the different methods across the first six scenarios and varying levels of measurement error variance with $n=400.$ In Scenario IV, which involves only measurement error, imputation-based methods are excluded and SCOM is equivalent to LASSO. Results for LASSO and other baseline methods are provided in the supplementary material to improve the readability and highlight the performance of our proposed methods.
  • Figure 2: F1-score of the different methods across the first six scenarios and varying levels of measurement error variance with $n=400.$Scenario IV, which involves only measurement error, imputation-based methods are excluded and SCOM is equivalent to LASSO. Results for LASSO and other baseline methods are provided in the supplementary material to improve the readability and highlight the performance of our proposed methods.
  • Figure 3: Mean squared error (MSE) and $R^2$ of the different methods on the test set, presented as bar plots with standard deviation error bars. Results for Scenario I (CSF + MRI + PET) are displayed in the top row, and results for Scenario II (CSF + MRI + PET + SNPs) in the bottom row.
  • Figure 4: Average effect sizes (colour gradient) of features selected at least 50% (square), 75% (triangle), or 100% (circle) of the time by a given method.

Theorems & Definitions (12)

  • Proposition 1
  • Theorem 1
  • Remark
  • Theorem 2
  • Remark
  • Theorem 3
  • Remark
  • proof : Proof of Proposition 1
  • proof : Proof of Theorem 1
  • Lemma 1
  • ...and 2 more