Table of Contents
Fetching ...

fastHDMI: Fast Mutual Information Estimation for High-Dimensional Data

Kai Yang, Masoud Asgharian, Nikhil Bhagwat, Jean-Baptiste Poline, Celia M. T. Greenwood

TL;DR

Seeks to solve high-dimensional neuroimaging variable screening by introducing fastHDMI, a Python toolkit implementing three mutual information estimators—FFTKDE, kNN, and binning—along with Pearson correlation as a baseline. Mutual information, defined as $I(X;Y)=D_{KL}(p(X,Y)\|p(X)p(Y))$, enables model-free screening of nonlinear dependencies in $p(X,Y)$. Through extensive simulations on ABIDE-like high-dimensional data and a case study predicting age and autism diagnosis, FFTKDE MI shows strongest performance for continuous nonlinear outcomes, binning MI for nonlinear binary outcomes, and Pearson correlation for linear cases. Collectively, fastHDMI demonstrates computational efficiency and robustness for scalable neuroimaging analysis and ultra-high-dimensional variable selection.

Abstract

In this paper, we introduce fastHDMI, a Python package designed for efficient variable screening in high-dimensional datasets, particularly neuroimaging data. This work pioneers the application of three mutual information estimation methods for neuroimaging variable selection, a novel approach implemented via fastHDMI. These advancements enhance our ability to analyze the complex structures of neuroimaging datasets, providing improved tools for variable selection in high-dimensional spaces. Using the preprocessed ABIDE dataset, we evaluate the performance of these methods through extensive simulations. The tests cover a range of conditions, including linear and nonlinear associations, as well as continuous and binary outcomes. Our results highlight the superiority of the FFTKDE-based mutual information estimation for feature screening in continuous nonlinear outcomes, while binning-based methods outperform others for binary outcomes with nonlinear probability preimages. For linear simulations, both Pearson correlation and FFTKDE-based methods show comparable performance for continuous outcomes, while Pearson excels in binary outcomes with linear probability preimages. A comprehensive case study using the ABIDE dataset further demonstrates fastHDMI's practical utility, showcasing the predictive power of models built from variables selected using our screening techniques. This research affirms the computational efficiency and methodological strength of fastHDMI, significantly enriching the toolkit available for neuroimaging analysis.

fastHDMI: Fast Mutual Information Estimation for High-Dimensional Data

TL;DR

Seeks to solve high-dimensional neuroimaging variable screening by introducing fastHDMI, a Python toolkit implementing three mutual information estimators—FFTKDE, kNN, and binning—along with Pearson correlation as a baseline. Mutual information, defined as , enables model-free screening of nonlinear dependencies in . Through extensive simulations on ABIDE-like high-dimensional data and a case study predicting age and autism diagnosis, FFTKDE MI shows strongest performance for continuous nonlinear outcomes, binning MI for nonlinear binary outcomes, and Pearson correlation for linear cases. Collectively, fastHDMI demonstrates computational efficiency and robustness for scalable neuroimaging analysis and ultra-high-dimensional variable selection.

Abstract

In this paper, we introduce fastHDMI, a Python package designed for efficient variable screening in high-dimensional datasets, particularly neuroimaging data. This work pioneers the application of three mutual information estimation methods for neuroimaging variable selection, a novel approach implemented via fastHDMI. These advancements enhance our ability to analyze the complex structures of neuroimaging datasets, providing improved tools for variable selection in high-dimensional spaces. Using the preprocessed ABIDE dataset, we evaluate the performance of these methods through extensive simulations. The tests cover a range of conditions, including linear and nonlinear associations, as well as continuous and binary outcomes. Our results highlight the superiority of the FFTKDE-based mutual information estimation for feature screening in continuous nonlinear outcomes, while binning-based methods outperform others for binary outcomes with nonlinear probability preimages. For linear simulations, both Pearson correlation and FFTKDE-based methods show comparable performance for continuous outcomes, while Pearson excels in binary outcomes with linear probability preimages. A comprehensive case study using the ABIDE dataset further demonstrates fastHDMI's practical utility, showcasing the predictive power of models built from variables selected using our screening techniques. This research affirms the computational efficiency and methodological strength of fastHDMI, significantly enriching the toolkit available for neuroimaging analysis.

Paper Structure

This paper contains 9 sections, 13 equations, 5 figures.

Figures (5)

  • Figure 1: Variable selection AUROC on the simulated nonlinear continuous and original/translated binary outcomes; the horizontal axis is the number of “ true” covariates used in the outcome simulation. Means with their $95\%$ confidence intervals were plotted for $100$ simulation replications.
  • Figure 2: Variable selection AUROC on the simulated linear continuous and original/translated binary outcomes; the horizontal axis is the number of “ true” covariates used in the outcome simulation. Means with their $95\%$ confidence intervals were plotted for $100$ simulation replications.
  • Figure 3: Running speeds of variable screening for continuous (age) and binary (diagnosis) outcomes utilizing the methods under study. The horizontal axis represents the proportion of features introduced into the screening phase, while the vertical axis measures the time in seconds to complete the screening. The plot displays the mean running times and their corresponding $95\%$ confidence intervals (C.I,), derived from 5 simulation replications.
  • Figure 4: Testing Set $R^{2}$ for age at the scan outcome v.s. the number of most associated brain imaging covariates based on the association measure rankings. Means with their $95\%$ confidence intervals were plotted for $20$ simulation replications.
  • Figure 5: Testing Set AUROC for autism diagnosis outcome v.s. the number of most associated brain imaging covariates based on the association measure rankings. Means with their $95\%$ confidence intervals were plotted for $20$ simulation replications.