Table of Contents
Fetching ...

Computational Approaches for Integrating out Subjectivity in Cognate Synonym Selection

Luise Häuser, Gerhard Jäger, Alexandros Stamatakis

TL;DR

This work tackles the challenge that subjectivity in cognate data, especially regarding synonyms, can bias language phylogeny inference. It systematically evaluates maximum likelihood tree inferences using RAxML-NG across three matrix representations (binary, probabilistic binary, probabilistic multi-valued) and compares them to a Glottolog gold standard, while also assessing the impact of selecting synonyms a priori. The study finds that manual synonym selection can drastically alter tree topologies, while including all synonyms generally yields plausible results; no single matrix type universally outperforms the others, highlighting a dataset-dependent benefit to evaluating all representations. Practically, the paper provides a Python interface to generate these matrix types from CLDF data, enabling researchers to objectively integrate synonym information and improve robustness of cognate-based phylogenetic inferences.

Abstract

Working with cognate data involves handling synonyms, that is, multiple words that describe the same concept in a language. In the early days of language phylogenetics it was recommended to select one synonym only. However, as we show here, binary character matrices, which are used as input for computational methods, do allow for representing the entire dataset including all synonyms. Here we address the question how one can and if one should include all synonyms or whether it is preferable to select synonyms a priori. To this end, we perform maximum likelihood tree inferences with the widely used RAxML-NG tool and show that it yields plausible trees when all synonyms are used as input. Furthermore, we show that a priori synonym selection can yield topologically substantially different trees and we therefore advise against doing so. To represent cognate data including all synonyms, we introduce two types of character matrices beyond the standard binary ones: probabilistic binary and probabilistic multi-valued character matrices. We further show that it is dataset-dependent for which character matrix type the inferred RAxML-NG tree is topologically closest to the gold standard. We also make available a Python interface for generating all of the above character matrix types for cognate data provided in CLDF format.

Computational Approaches for Integrating out Subjectivity in Cognate Synonym Selection

TL;DR

This work tackles the challenge that subjectivity in cognate data, especially regarding synonyms, can bias language phylogeny inference. It systematically evaluates maximum likelihood tree inferences using RAxML-NG across three matrix representations (binary, probabilistic binary, probabilistic multi-valued) and compares them to a Glottolog gold standard, while also assessing the impact of selecting synonyms a priori. The study finds that manual synonym selection can drastically alter tree topologies, while including all synonyms generally yields plausible results; no single matrix type universally outperforms the others, highlighting a dataset-dependent benefit to evaluating all representations. Practically, the paper provides a Python interface to generate these matrix types from CLDF data, enabling researchers to objectively integrate synonym information and improve robustness of cognate-based phylogenetic inferences.

Abstract

Working with cognate data involves handling synonyms, that is, multiple words that describe the same concept in a language. In the early days of language phylogenetics it was recommended to select one synonym only. However, as we show here, binary character matrices, which are used as input for computational methods, do allow for representing the entire dataset including all synonyms. Here we address the question how one can and if one should include all synonyms or whether it is preferable to select synonyms a priori. To this end, we perform maximum likelihood tree inferences with the widely used RAxML-NG tool and show that it yields plausible trees when all synonyms are used as input. Furthermore, we show that a priori synonym selection can yield topologically substantially different trees and we therefore advise against doing so. To represent cognate data including all synonyms, we introduce two types of character matrices beyond the standard binary ones: probabilistic binary and probabilistic multi-valued character matrices. We further show that it is dataset-dependent for which character matrix type the inferred RAxML-NG tree is topologically closest to the gold standard. We also make available a Python interface for generating all of the above character matrix types for cognate data provided in CLDF format.
Paper Structure (17 sections, 4 equations, 12 figures, 1 table)

This paper contains 17 sections, 4 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: (a): Native cognate data (b): Corresponding matrix $M$ with cognate classes (b)
  • Figure 2: (a): Binary character matrix $A^{\mathop{\mathrm{b}}\nolimits}$ (b): Multi-valued character matrix $A^{\mathop{\mathrm{m}}\nolimits}$, both corresponding to the cognate dataset in \ref{['fig:msa-example']}. Note that $A^{\mathop{\mathrm{m}}\nolimits}$ is invalid, because $M($English, big$)$ is a multi-state cell.
  • Figure 3: Tree with conditional likelihood tip vectors for the per-column likelihood of the column big_2. The light gray vectors refer to the inference on $A^{\mathop{\mathrm{b}}\nolimits}$ (\ref{['fig:msa-exampled-deterministic']}), the dark gray ones to the inference on $A^{\mathop{\mathrm{b*}}\nolimits}$ (\ref{['fig:msa-example-probabilistic']}).
  • Figure 4: (a): Probabilistic binary character matrix $A^{\mathop{\mathrm{b}}\nolimits}$ (b): Probabilistic multi-valued character matrix $A^{\mathop{\mathrm{m}}\nolimits}$, both corresponding to the cognate dataset in \ref{['fig:msa-example']}.
  • Figure 5: Experimental setup for assessing the effects of synonym selection: For each dataset, we create $100$ selection samples and construct the corresponding (deterministic) binary character matrices $A^{\mathop{\mathrm{b}}\nolimits}_i$, $i \in {1, \dots, 100}$. $A^{\mathop{\mathrm{b}}\nolimits}_{\mathop{\mathrm{full}}\nolimits}$ represents the complete dataset including all synonyms. For each character matrix we consider the best scoring tree resulting from 20 independent tree searches with RAxML-NG, denoted by $T_i$, $i \in {1, \dots, 100}$, $T_{\mathop{\mathrm{full}}\nolimits}$ respectively. $\delta_i$ corresponds to the RF distance between $T_i$ and $T_{\mathop{\mathrm{full}}\nolimits}$, $\rho_i$ to the GQ distance between $T_i$ and the gold standard tree. By $\rho_{\mathop{\mathrm{full}}\nolimits}$ we denote the GQ distance between $T_{\mathop{\mathrm{full}}\nolimits}$ and the gold standard.
  • ...and 7 more figures