Blockwise Principal Component Analysis for monotone missing data imputation and dimensionality reduction
Tu T. Do, Mai Anh Vu, Tuan L. Vo, Hoang Thien Ly, Thu Nguyen, Steven A. Hicks, Michael A. Riegler, Pål Halvorsen, Binh T. Nguyen
TL;DR
Monotone missing data poses a computational bottleneck for simultaneous imputation and dimensionality reduction. Blockwise PCA Imputation (BPI) applies PCA to the observed portion of each monotone block to obtain blockwise projections $\boldsymbol{z}_i$, stacks them into $\boldsymbol{z}^*$ with missing indicators, and then imputes on this reduced representation, using a chosen imputer. The approach is analyzed theoretically via eigenvalue interlacing, yielding bounds on the average explained variance $EV_q$ in terms of block-specific $EV^{(i)}_{q_i}$ and the number of blocks $k$, and is validated experimentally on MNIST, Fashion MNIST, and PANCAN RNA-seq showing substantial imputation-time reductions (52–88%) with minor accuracy trade-offs depending on the classifier. This work demonstrates that BPI enables scalable, efficient handling of monotone missing data in high-dimensional settings and suggests avenues for future extension to categorical data and potential denoising benefits from PCA.
Abstract
Monotone missing data is a common problem in data analysis. However, imputation combined with dimensionality reduction can be computationally expensive, especially with the increasing size of datasets. To address this issue, we propose a Blockwise principal component analysis Imputation (BPI) framework for dimensionality reduction and imputation of monotone missing data. The framework conducts Principal Component Analysis (PCA) on the observed part of each monotone block of the data and then imputes on merging the obtained principal components using a chosen imputation technique. BPI can work with various imputation techniques and can significantly reduce imputation time compared to conducting dimensionality reduction after imputation. This makes it a practical and efficient approach for large datasets with monotone missing data. Our experiments validate the improvement in speed. In addition, our experiments also show that while applying MICE imputation directly on missing data may not yield convergence, applying BPI with MICE for the data may lead to convergence.
