fastHDMI: Fast Mutual Information Estimation for High-Dimensional Data
Kai Yang, Masoud Asgharian, Nikhil Bhagwat, Jean-Baptiste Poline, Celia M. T. Greenwood
TL;DR
Seeks to solve high-dimensional neuroimaging variable screening by introducing fastHDMI, a Python toolkit implementing three mutual information estimators—FFTKDE, kNN, and binning—along with Pearson correlation as a baseline. Mutual information, defined as $I(X;Y)=D_{KL}(p(X,Y)\|p(X)p(Y))$, enables model-free screening of nonlinear dependencies in $p(X,Y)$. Through extensive simulations on ABIDE-like high-dimensional data and a case study predicting age and autism diagnosis, FFTKDE MI shows strongest performance for continuous nonlinear outcomes, binning MI for nonlinear binary outcomes, and Pearson correlation for linear cases. Collectively, fastHDMI demonstrates computational efficiency and robustness for scalable neuroimaging analysis and ultra-high-dimensional variable selection.
Abstract
In this paper, we introduce fastHDMI, a Python package designed for efficient variable screening in high-dimensional datasets, particularly neuroimaging data. This work pioneers the application of three mutual information estimation methods for neuroimaging variable selection, a novel approach implemented via fastHDMI. These advancements enhance our ability to analyze the complex structures of neuroimaging datasets, providing improved tools for variable selection in high-dimensional spaces. Using the preprocessed ABIDE dataset, we evaluate the performance of these methods through extensive simulations. The tests cover a range of conditions, including linear and nonlinear associations, as well as continuous and binary outcomes. Our results highlight the superiority of the FFTKDE-based mutual information estimation for feature screening in continuous nonlinear outcomes, while binning-based methods outperform others for binary outcomes with nonlinear probability preimages. For linear simulations, both Pearson correlation and FFTKDE-based methods show comparable performance for continuous outcomes, while Pearson excels in binary outcomes with linear probability preimages. A comprehensive case study using the ABIDE dataset further demonstrates fastHDMI's practical utility, showcasing the predictive power of models built from variables selected using our screening techniques. This research affirms the computational efficiency and methodological strength of fastHDMI, significantly enriching the toolkit available for neuroimaging analysis.
