Table of Contents
Fetching ...

Principal Component Analysis as a Sanity Check for Bayesian Phylolinguistic Reconstruction

Yugo Murawaki

TL;DR

This paper addresses the challenge of validating the tree-model assumption in Bayesian phylolinguistics by introducing a PCA-based sanity check that projects BEAST-sampled trees into a PCA space learned from observed binary lexical features, enabling visualization of anomalies such as jogging. The method relies on projecting posterior node states onto the first two principal components ($PC_1$, $PC_2$) to reveal departures from unidirectional, tree-like diversification. Using synthetic datasets with varied borrowing regimes and real data from Japonic, Sino-Tibetan, and Northeast Asian archaeological sites, the authors show that jogging signals non-tree-like evolution and horizontal transfer, while the absence of jogging does not guarantee model validity. The approach is simple, parameter-light, and broadly applicable to published studies, with available code to facilitate replication and adoption in phylolinguistic analyses.

Abstract

Bayesian approaches to reconstructing the evolutionary history of languages rely on the tree model, which assumes that these languages descended from a common ancestor and underwent modifications over time. However, this assumption can be violated to different extents due to contact and other factors. Understanding the degree to which this assumption is violated is crucial for validating the accuracy of phylolinguistic inference. In this paper, we propose a simple sanity check: projecting a reconstructed tree onto a space generated by principal component analysis. By using both synthetic and real data, we demonstrate that our method effectively visualizes anomalies, particularly in the form of jogging.

Principal Component Analysis as a Sanity Check for Bayesian Phylolinguistic Reconstruction

TL;DR

This paper addresses the challenge of validating the tree-model assumption in Bayesian phylolinguistics by introducing a PCA-based sanity check that projects BEAST-sampled trees into a PCA space learned from observed binary lexical features, enabling visualization of anomalies such as jogging. The method relies on projecting posterior node states onto the first two principal components (, ) to reveal departures from unidirectional, tree-like diversification. Using synthetic datasets with varied borrowing regimes and real data from Japonic, Sino-Tibetan, and Northeast Asian archaeological sites, the authors show that jogging signals non-tree-like evolution and horizontal transfer, while the absence of jogging does not guarantee model validity. The approach is simple, parameter-light, and broadly applicable to published studies, with available code to facilitate replication and adoption in phylolinguistic analyses.

Abstract

Bayesian approaches to reconstructing the evolutionary history of languages rely on the tree model, which assumes that these languages descended from a common ancestor and underwent modifications over time. However, this assumption can be violated to different extents due to contact and other factors. Understanding the degree to which this assumption is violated is crucial for validating the accuracy of phylolinguistic inference. In this paper, we propose a simple sanity check: projecting a reconstructed tree onto a space generated by principal component analysis. By using both synthetic and real data, we demonstrate that our method effectively visualizes anomalies, particularly in the form of jogging.
Paper Structure (20 sections, 3 equations, 11 figures)

This paper contains 20 sections, 3 equations, 11 figures.

Figures (11)

  • Figure 1: Overview of the proposed method. In this example, we reconstruct a phylogenetic tree for four modern languages, resulting in three ancestral nodes with explicitly represented states. The states of these seven languages are then subjected to principal component analysis (PCA), followed by projection onto a low-dimensional space. The downward path from the root to LangC exhibits jogging.
  • Figure 2: PCA of Bayesian phylolinguistic reconstruction for the skewed time-tree of data simulation, with four borrowing scenarios. We used the first two PCs, denoted as PC1 and PC2. A percentage indicates the amount of variance explained by the corresponding PC. Circles indicate observed leaf nodes while rectangles denote reconstructed internal nodes.
  • Figure 3: PCA for the balanced time-tree of data simulation, with four borrowing scenarios.
  • Figure 4: PCA for a Japonic sample tree. Left: The entire tree. Right: Zoomed-in view of the mainland portion. Kagoshima (underlined) is the closest modern mainland dialect to Old Japanese along PC1.
  • Figure 5: PCA for a Sino-Tibetan sample tree. Left: The entire tree. Right: Zoomed-in view of the central portion.
  • ...and 6 more figures