Table of Contents
Fetching ...

Variable Selection Methods for Multivariate, Functional, and Complex Biomedical Data in the AI Age

Marcos Matabuena

TL;DR

This work proposes new optimization-based variable selection methods for multivariate, functional, and even more general outcomes in metrics spaces based on best-subset selection, and demonstrates that the proposed methodology outperforms state-of-the-art methods in accuracy and speed.

Abstract

Many problems within personalized medicine and digital health rely on the analysis of continuous-time functional biomarkers and other complex data structures emerging from high-resolution patient monitoring. In this context, this work proposes new optimization-based variable selection methods for multivariate, functional, and even more general outcomes in metrics spaces based on best-subset selection. Our framework applies to several types of regression models, including linear, quantile, or non parametric additive models, and to a broad range of random responses, such as univariate, multivariate Euclidean data, functional, and even random graphs. Our analysis demonstrates that our proposed methodology outperforms state-of-the-art methods in accuracy and, especially, in speed-achieving several orders of magnitude improvement over competitors across various type of statistical responses as the case of mathematical functions. While our framework is general and is not designed for a specific regression and scientific problem, the article is self-contained and focuses on biomedical applications. In the clinical areas, serves as a valuable resource for professionals in biostatistics, statistics, and artificial intelligence interested in variable selection problem in this new technological AI-era.

Variable Selection Methods for Multivariate, Functional, and Complex Biomedical Data in the AI Age

TL;DR

This work proposes new optimization-based variable selection methods for multivariate, functional, and even more general outcomes in metrics spaces based on best-subset selection, and demonstrates that the proposed methodology outperforms state-of-the-art methods in accuracy and speed.

Abstract

Many problems within personalized medicine and digital health rely on the analysis of continuous-time functional biomarkers and other complex data structures emerging from high-resolution patient monitoring. In this context, this work proposes new optimization-based variable selection methods for multivariate, functional, and even more general outcomes in metrics spaces based on best-subset selection. Our framework applies to several types of regression models, including linear, quantile, or non parametric additive models, and to a broad range of random responses, such as univariate, multivariate Euclidean data, functional, and even random graphs. Our analysis demonstrates that our proposed methodology outperforms state-of-the-art methods in accuracy and, especially, in speed-achieving several orders of magnitude improvement over competitors across various type of statistical responses as the case of mathematical functions. While our framework is general and is not designed for a specific regression and scientific problem, the article is self-contained and focuses on biomedical applications. In the clinical areas, serves as a valuable resource for professionals in biostatistics, statistics, and artificial intelligence interested in variable selection problem in this new technological AI-era.
Paper Structure (35 sections, 7 theorems, 44 equations, 5 figures, 5 tables)

This paper contains 35 sections, 7 theorems, 44 equations, 5 figures, 5 tables.

Key Result

Theorem 1

For any convex loss functions $\ell_t$, and assume additive linear structure across different loss function $\ell_t, t\in [m],$ the optimization problem eqn:generic.ss is equivalent to where $\hat{\ell}(y,a):= \max_{u\in \mathbb{R}} u a-\ell(y,u)$ is a convex function known as the Fenchel conjugate of $\ell$bauschke2012fenchel.In particular, the function $f$ is continuous, linear in $s$, and conc

Figures (5)

  • Figure 1: Variation in the mean and standard deviation of glucose values for a diabetic individual depending on the day of the week and time of day.
  • Figure 2: Average glucose trajectories (left) and standard deviation trejectories (right).
  • Figure 3: Left: Raw CGM time series of two individuals. Center: The corresponding density functions. Right: The corresponding quantile representation.
  • Figure 4: Raw Quantile Outcomes
  • Figure 5: P-values across the temporal domain of the statistical significance of each variable selected

Theorems & Definitions (17)

  • Remark 1
  • Theorem 1
  • Theorem 1
  • Remark 2
  • Definition 1
  • Theorem 2: Schoenberg (1937, 1938)schoenberg1937certainschoenberg1938metric
  • Example 1: Laplacian graph
  • Example 2: The 2-Wasserstein Distance in the univariate case
  • Remark 3
  • Proposition 3
  • ...and 7 more