Informational Rescaling of PCA Maps with Application to Genetic Distance
Nassim Nicholas Taleb, Pierre Zalloua, Khaled Elbassioni, Andreas Henschel, Daniel E. Platt
TL;DR
This paper identifies a fundamental mismatch between covariance-based distance metrics and information-theoretic distances in PCA visualizations, particularly for genetic data. It proposes a simple, MI-based re-scaling of PCA coordinates that preserves order along components while aligning distances with information, leveraging the Gaussian relationship I_{X,Y} = -1/2 log(1-ρ^2). A matrix-formulation is provided to implement the transform via standard linear-algebra steps (centering, SVD, and monotone transforms), and the method is demonstrated on global population data, revealing notable clustering and proximity shifts not captured by conventional PCA. The work suggests that many prior genetic-distance conclusions could be revisited under an information-distance framework, with practical implications for population genetics and genomic interpretation.
Abstract
We discuss the inadequacy of covariances/correlations and other measures in L2 as relative distance metrics under some conditions. We propose a computationally simple heuristic to transform a map based on standard principal component analysis (PCA) (when the variables are asymptotically Gaussian) into an entropy-based map where distances are based on mutual information (MI). Rescaling Principal Component based distances using MI allows a representation of relative statistical associations when, as in genetics, it is applied on bit measurements between individuals' genomic mutual information. This entropy rescaled PCA, while preserving order relationships (along a dimension), changes the relative distances to make them linear to information. We show the effect on the entire world population and some subsamples, which leads to significant differences with the results of current research.
