MML Probabilistic Principal Component Analysis

Enes Makalic; Daniel F. Schmidt

MML Probabilistic Principal Component Analysis

Enes Makalic, Daniel F. Schmidt

TL;DR

The paper tackles automatic selection of the number of principal components and improved residual-variance estimation in probabilistic PCA. The data are modeled as $${\bf x}_i = {\bf A}{\bf v}_i + \bm{\epsilon}_i,$$ with $\bm{\epsilon}_i \sim N({\bf 0}, \sigma^2{\bf I}_K)$. It introduces a Bayesian minimum message length (MML) approach, deriving a tractable codelength via the MML87 approximation and a polynomial-based solution for the residual variance, while connecting factor-detection thresholds to BBP phase transitions. Empirical results show the MML residual-variance estimator is less biased than ML and that MML-based model selection outperforms BIC and tracks Bayes performance, with code available for reproduction. Overall, the approach provides automatic component selection and improved parameter estimation for probabilistic PCA and offers a pathway to extending MML PCA to finite mixtures and related models.

Abstract

Principal component analysis (PCA) is perhaps the most widely used method for data dimensionality reduction. A key question in PCA is deciding how many factors to retain. This manuscript describes a new approach to automatically selecting the number of principal components based on the Bayesian minimum message length method of inductive inference. We derive a new estimate of the isotropic residual variance and demonstrate that it improves on the usual maximum likelihood approach. We also discuss extending this approach to finite mixture models of principal component analyzers.

MML Probabilistic Principal Component Analysis

TL;DR

The paper tackles automatic selection of the number of principal components and improved residual-variance estimation in probabilistic PCA. The data are modeled as

with

. It introduces a Bayesian minimum message length (MML) approach, deriving a tractable codelength via the MML87 approximation and a polynomial-based solution for the residual variance, while connecting factor-detection thresholds to BBP phase transitions. Empirical results show the MML residual-variance estimator is less biased than ML and that MML-based model selection outperforms BIC and tracks Bayes performance, with code available for reproduction. Overall, the approach provides automatic component selection and improved parameter estimation for probabilistic PCA and offers a pathway to extending MML PCA to finite mixtures and related models.

Abstract

Paper Structure (12 sections, 4 theorems, 121 equations, 3 tables)

This paper contains 12 sections, 4 theorems, 121 equations, 3 tables.

Introduction
Maximum likelihood estimation
Minimum message length analysis of the PCA model
Orthogonality constraints
Fisher information
Prior information
Codelength
Experiments
Parameter estimation
Model selection
Discussion
Data Availability

Key Result

Theorem 1

Let $\tau = \sigma^2$. The concentrated codelength (eqn:msglen:conc) has $(J+1)$ stationary points equal to the roots of the $n=(J+1)$-degree gradient polynomial with coefficients where $\hat{\tau}_{\rm ML}$ is the maximum likelihood estimate of the residual variance and $e_t$ denote elementary symmetric polynomials $e_t(\delta_1,\ldots,\delta_J)$ in $J$ variables $(\delta_1, \ldots, \delta_J)$.

Theorems & Definitions (8)

Theorem 1
proof
Theorem 2
proof
Theorem 3
proof
Theorem 4
proof

MML Probabilistic Principal Component Analysis

TL;DR

Abstract

MML Probabilistic Principal Component Analysis

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (8)