Blockwise Principal Component Analysis for monotone missing data imputation and dimensionality reduction

Tu T. Do; Mai Anh Vu; Tuan L. Vo; Hoang Thien Ly; Thu Nguyen; Steven A. Hicks; Michael A. Riegler; Pål Halvorsen; Binh T. Nguyen

Blockwise Principal Component Analysis for monotone missing data imputation and dimensionality reduction

Tu T. Do, Mai Anh Vu, Tuan L. Vo, Hoang Thien Ly, Thu Nguyen, Steven A. Hicks, Michael A. Riegler, Pål Halvorsen, Binh T. Nguyen

TL;DR

Monotone missing data poses a computational bottleneck for simultaneous imputation and dimensionality reduction. Blockwise PCA Imputation (BPI) applies PCA to the observed portion of each monotone block to obtain blockwise projections $\boldsymbol{z}_i$, stacks them into $\boldsymbol{z}^*$ with missing indicators, and then imputes on this reduced representation, using a chosen imputer. The approach is analyzed theoretically via eigenvalue interlacing, yielding bounds on the average explained variance $EV_q$ in terms of block-specific $EV^{(i)}_{q_i}$ and the number of blocks $k$, and is validated experimentally on MNIST, Fashion MNIST, and PANCAN RNA-seq showing substantial imputation-time reductions (52–88%) with minor accuracy trade-offs depending on the classifier. This work demonstrates that BPI enables scalable, efficient handling of monotone missing data in high-dimensional settings and suggests avenues for future extension to categorical data and potential denoising benefits from PCA.

Abstract

Monotone missing data is a common problem in data analysis. However, imputation combined with dimensionality reduction can be computationally expensive, especially with the increasing size of datasets. To address this issue, we propose a Blockwise principal component analysis Imputation (BPI) framework for dimensionality reduction and imputation of monotone missing data. The framework conducts Principal Component Analysis (PCA) on the observed part of each monotone block of the data and then imputes on merging the obtained principal components using a chosen imputation technique. BPI can work with various imputation techniques and can significantly reduce imputation time compared to conducting dimensionality reduction after imputation. This makes it a practical and efficient approach for large datasets with monotone missing data. Our experiments validate the improvement in speed. In addition, our experiments also show that while applying MICE imputation directly on missing data may not yield convergence, applying BPI with MICE for the data may lead to convergence.

Blockwise Principal Component Analysis for monotone missing data imputation and dimensionality reduction

TL;DR

, stacks them into

with missing indicators, and then imputes on this reduced representation, using a chosen imputer. The approach is analyzed theoretically via eigenvalue interlacing, yielding bounds on the average explained variance

in terms of block-specific

and the number of blocks

, and is validated experimentally on MNIST, Fashion MNIST, and PANCAN RNA-seq showing substantial imputation-time reductions (52–88%) with minor accuracy trade-offs depending on the classifier. This work demonstrates that BPI enables scalable, efficient handling of monotone missing data in high-dimensional settings and suggests avenues for future extension to categorical data and potential denoising benefits from PCA.

Abstract

Paper Structure (9 sections, 24 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 9 sections, 24 equations, 3 figures, 2 tables, 1 algorithm.

Introduction
Related Works
Methodology
Example
Theoretical analysis
Experiments
Experimental setting
Result & Analysis
Conclusion

Figures (3)

Figure 1: Time and accuracy comparison between Baseline and BPI with different imputation methods across various datasets using SVM classifier.
Figure 2: Time and accuracy comparison between Baseline and BPI with different imputation methods across various datasets using KNN classifier.
Figure 3: Time and accuracy comparison between Baseline and BPI with different imputation methods across various datasets using a Neural Network classifier.

Blockwise Principal Component Analysis for monotone missing data imputation and dimensionality reduction

TL;DR

Abstract

Blockwise Principal Component Analysis for monotone missing data imputation and dimensionality reduction

Authors

TL;DR

Abstract

Table of Contents

Figures (3)