Table of Contents
Fetching ...

A PCA-based Data Prediction Method

Peteris Daugulis, Vija Vagale, Emiliano Mancini, Filippo Castiglione

TL;DR

This work introduces the PCA-distance method for imputing missing data by representing the complete data with a PCA-derived subspace and selecting a candidate point from a shifted prediction subspace that minimizes their distance. For the Euclidean metric, the authors derive closed-form solutions via orthogonal projections for both one-dimensional and multidimensional prediction spaces, and they extend the framework to outlier handling, cross-validation, and confidence interval estimation. The method is implemented with scaling and a detailed algorithm, and demonstrated on a real-world antimicrobial resistance dataset, showing competitive predictive performance relative to existing approaches. The approach emphasizes a geometric, regression-free view of imputation, suitable for high-dimensional, heterogeneous data without strong distributional assumptions, and is framed as a bridge between unsupervised feature learning and semi-supervised prediction.

Abstract

The problem of choosing appropriate values for missing data is often encountered in the data science. We describe a novel method containing both traditional mathematics and machine learning elements for prediction (imputation) of missing data. This method is based on the notion of distance between shifted linear subspaces representing the existing data and candidate sets. The existing data set is represented by the subspace spanned by its first principal components. Solutions for the case of the Euclidean metric are given.

A PCA-based Data Prediction Method

TL;DR

This work introduces the PCA-distance method for imputing missing data by representing the complete data with a PCA-derived subspace and selecting a candidate point from a shifted prediction subspace that minimizes their distance. For the Euclidean metric, the authors derive closed-form solutions via orthogonal projections for both one-dimensional and multidimensional prediction spaces, and they extend the framework to outlier handling, cross-validation, and confidence interval estimation. The method is implemented with scaling and a detailed algorithm, and demonstrated on a real-world antimicrobial resistance dataset, showing competitive predictive performance relative to existing approaches. The approach emphasizes a geometric, regression-free view of imputation, suitable for high-dimensional, heterogeneous data without strong distributional assumptions, and is framed as a bridge between unsupervised feature learning and semi-supervised prediction.

Abstract

The problem of choosing appropriate values for missing data is often encountered in the data science. We describe a novel method containing both traditional mathematics and machine learning elements for prediction (imputation) of missing data. This method is based on the notion of distance between shifted linear subspaces representing the existing data and candidate sets. The existing data set is represented by the subspace spanned by its first principal components. Solutions for the case of the Euclidean metric are given.

Paper Structure

This paper contains 24 sections, 6 theorems, 16 equations.

Key Result

Proposition 1.2.1

Let $p_{1},...,p_{n}$ be linearly independent elements in $\mathbb{R}^{m}$ , $P=[p_{1}|...|p_{n}]$ is the $m\times n$ matrix obtained by joining $p_{1},..,p_{n}$. Denote where $w_{1}$ is the first column of $W$. Let $L=\{\left[ \right]|t\in \mathbb{R}\}$, $l'\in \mathbb{R}^{m-1}$ fixed, an affine line in $\mathbb{R}^{m}$. Let $\mathcal{P}=\langle p_{1},...,p_{n}\rangle\le \mathbb{R}^{m}$.

Theorems & Definitions (18)

  • Proposition 1.2.1
  • proof
  • Remark 1.2.2
  • Remark 1.2.3
  • Remark 1.2.4
  • Proposition 1.2.5
  • proof
  • Proposition 1.2.6
  • proof
  • Remark 1.2.7
  • ...and 8 more