Principal Component Analysis as a Sanity Check for Bayesian Phylolinguistic Reconstruction
Yugo Murawaki
TL;DR
This paper addresses the challenge of validating the tree-model assumption in Bayesian phylolinguistics by introducing a PCA-based sanity check that projects BEAST-sampled trees into a PCA space learned from observed binary lexical features, enabling visualization of anomalies such as jogging. The method relies on projecting posterior node states onto the first two principal components ($PC_1$, $PC_2$) to reveal departures from unidirectional, tree-like diversification. Using synthetic datasets with varied borrowing regimes and real data from Japonic, Sino-Tibetan, and Northeast Asian archaeological sites, the authors show that jogging signals non-tree-like evolution and horizontal transfer, while the absence of jogging does not guarantee model validity. The approach is simple, parameter-light, and broadly applicable to published studies, with available code to facilitate replication and adoption in phylolinguistic analyses.
Abstract
Bayesian approaches to reconstructing the evolutionary history of languages rely on the tree model, which assumes that these languages descended from a common ancestor and underwent modifications over time. However, this assumption can be violated to different extents due to contact and other factors. Understanding the degree to which this assumption is violated is crucial for validating the accuracy of phylolinguistic inference. In this paper, we propose a simple sanity check: projecting a reconstructed tree onto a space generated by principal component analysis. By using both synthetic and real data, we demonstrate that our method effectively visualizes anomalies, particularly in the form of jogging.
