MML Probabilistic Principal Component Analysis
Enes Makalic, Daniel F. Schmidt
TL;DR
The paper tackles automatic selection of the number of principal components and improved residual-variance estimation in probabilistic PCA. The data are modeled as $${\bf x}_i = {\bf A}{\bf v}_i + \bm{\epsilon}_i,$$ with $\bm{\epsilon}_i \sim N({\bf 0}, \sigma^2{\bf I}_K)$. It introduces a Bayesian minimum message length (MML) approach, deriving a tractable codelength via the MML87 approximation and a polynomial-based solution for the residual variance, while connecting factor-detection thresholds to BBP phase transitions. Empirical results show the MML residual-variance estimator is less biased than ML and that MML-based model selection outperforms BIC and tracks Bayes performance, with code available for reproduction. Overall, the approach provides automatic component selection and improved parameter estimation for probabilistic PCA and offers a pathway to extending MML PCA to finite mixtures and related models.
Abstract
Principal component analysis (PCA) is perhaps the most widely used method for data dimensionality reduction. A key question in PCA is deciding how many factors to retain. This manuscript describes a new approach to automatically selecting the number of principal components based on the Bayesian minimum message length method of inductive inference. We derive a new estimate of the isotropic residual variance and demonstrate that it improves on the usual maximum likelihood approach. We also discuss extending this approach to finite mixture models of principal component analyzers.
