Informational Rescaling of PCA Maps with Application to Genetic Distance

Nassim Nicholas Taleb; Pierre Zalloua; Khaled Elbassioni; Andreas Henschel; Daniel E. Platt

Informational Rescaling of PCA Maps with Application to Genetic Distance

Nassim Nicholas Taleb, Pierre Zalloua, Khaled Elbassioni, Andreas Henschel, Daniel E. Platt

TL;DR

This paper identifies a fundamental mismatch between covariance-based distance metrics and information-theoretic distances in PCA visualizations, particularly for genetic data. It proposes a simple, MI-based re-scaling of PCA coordinates that preserves order along components while aligning distances with information, leveraging the Gaussian relationship I_{X,Y} = -1/2 log(1-ρ^2). A matrix-formulation is provided to implement the transform via standard linear-algebra steps (centering, SVD, and monotone transforms), and the method is demonstrated on global population data, revealing notable clustering and proximity shifts not captured by conventional PCA. The work suggests that many prior genetic-distance conclusions could be revisited under an information-distance framework, with practical implications for population genetics and genomic interpretation.

Abstract

We discuss the inadequacy of covariances/correlations and other measures in L2 as relative distance metrics under some conditions. We propose a computationally simple heuristic to transform a map based on standard principal component analysis (PCA) (when the variables are asymptotically Gaussian) into an entropy-based map where distances are based on mutual information (MI). Rescaling Principal Component based distances using MI allows a representation of relative statistical associations when, as in genetics, it is applied on bit measurements between individuals' genomic mutual information. This entropy rescaled PCA, while preserving order relationships (along a dimension), changes the relative distances to make them linear to information. We show the effect on the entire world population and some subsamples, which leads to significant differences with the results of current research.

Informational Rescaling of PCA Maps with Application to Genetic Distance

TL;DR

Abstract

Paper Structure (8 sections, 17 equations, 4 figures, 1 table)

This paper contains 8 sections, 17 equations, 4 figures, 1 table.

Introduction: The problem of correlation
Information and correlation
PCA under Mutual Information
Mutual Information
Re-scaling PCA distances using Mutual Information
In Matrix Notation
Discussion and Application to Genetic Distance
Supplementary Material

Figures (4)

Figure 1: Transformation of PCA maps to accommodate informational distances
Figure 2: The visual intuition for the three possible methods for informational distances. We generate bivariate normal distributions for $X$ and $Y$, and represent the iso-densities on the $X$ and $Y$ axes. Each square is equidistant with respect to the parameters correlation, correlation squared, and MI to the one to its left and its right, above and below it, as well as on the diagonal. MI matches our visual intuition.
Figure 3: Conventional Principal Component analysis for 5 populations: Buryat, Spanish, Sri Lankan Tamil in the UK (STU), Colombian in Medellín, Colombia (CLM) and Gujarati Indians in Houston, Texas, USA (GIH). While the gap between CLM and GIH appears rather large in conventional PCA, comparable to the distance between CLM and Buryat, rescaling places CLM substantially closer to GIH, shown in b).
Figure 4: A different world view: the commonly observed triangular PCA shape of world populations undergoes proximity rearrangements using information based rescaling. Non-African and non-Asian populations are much closer together in b).

Informational Rescaling of PCA Maps with Application to Genetic Distance

TL;DR

Abstract

Informational Rescaling of PCA Maps with Application to Genetic Distance

Authors

TL;DR

Abstract

Table of Contents

Figures (4)