A PCA-based Data Prediction Method
Peteris Daugulis, Vija Vagale, Emiliano Mancini, Filippo Castiglione
TL;DR
This work introduces the PCA-distance method for imputing missing data by representing the complete data with a PCA-derived subspace and selecting a candidate point from a shifted prediction subspace that minimizes their distance. For the Euclidean metric, the authors derive closed-form solutions via orthogonal projections for both one-dimensional and multidimensional prediction spaces, and they extend the framework to outlier handling, cross-validation, and confidence interval estimation. The method is implemented with scaling and a detailed algorithm, and demonstrated on a real-world antimicrobial resistance dataset, showing competitive predictive performance relative to existing approaches. The approach emphasizes a geometric, regression-free view of imputation, suitable for high-dimensional, heterogeneous data without strong distributional assumptions, and is framed as a bridge between unsupervised feature learning and semi-supervised prediction.
Abstract
The problem of choosing appropriate values for missing data is often encountered in the data science. We describe a novel method containing both traditional mathematics and machine learning elements for prediction (imputation) of missing data. This method is based on the notion of distance between shifted linear subspaces representing the existing data and candidate sets. The existing data set is represented by the subspace spanned by its first principal components. Solutions for the case of the Euclidean metric are given.
