Table of Contents
Fetching ...

Unstable Grounds for Beautiful Trees? Testing the Robustness of Concept Translations in the Compilation of Multilingual Wordlists

David Snee, Luca Ciucci, Arne Rubehn, Kellen Parker van Dam, Johann-Mattis List

TL;DR

The paper addresses the robustness of concept translations in multilingual wordlists used for phylogenetic analyses. It combines Lexibank-derived data across 9 language families and uses an SCA-distance–based framework, supplemented by Levenshtein metrics, to quantify translation variability and transcriptional differences, with manual annotation as a gold standard. The authors find that only about $83\%$ of translations yield the same word form and that phonetic identity occurs in just $23\%$ of cases, highlighting significant uncertainty for downstream phylogenetic inferences (notably across non-Indo-European families). They argue for robustness checks, increased inter-annotator validation, and more cross-family data to ensure reliable linguistic phylogenies while preserving the value of computational phylogenetics.

Abstract

Multilingual wordlists play a crucial role in comparative linguistics. While many studies have been carried out to test the power of computational methods for language subgrouping or divergence time estimation, few studies have put the data upon which these studies are based to a rigorous test. Here, we conduct a first experiment that tests the robustness of concept translation as an integral part of the compilation of multilingual wordlists. Investigating the variation in concept translations in independently compiled wordlists from 10 dataset pairs covering 9 different language families, we find that on average, only 83% of all translations yield the same word form, while identical forms in terms of phonetic transcriptions can only be found in 23% of all cases. Our findings can prove important when trying to assess the uncertainty of phylogenetic studies and the conclusions derived from them.

Unstable Grounds for Beautiful Trees? Testing the Robustness of Concept Translations in the Compilation of Multilingual Wordlists

TL;DR

The paper addresses the robustness of concept translations in multilingual wordlists used for phylogenetic analyses. It combines Lexibank-derived data across 9 language families and uses an SCA-distance–based framework, supplemented by Levenshtein metrics, to quantify translation variability and transcriptional differences, with manual annotation as a gold standard. The authors find that only about of translations yield the same word form and that phonetic identity occurs in just of cases, highlighting significant uncertainty for downstream phylogenetic inferences (notably across non-Indo-European families). They argue for robustness checks, increased inter-annotator validation, and more cross-family data to ensure reliable linguistic phylogenies while preserving the value of computational phylogenetics.

Abstract

Multilingual wordlists play a crucial role in comparative linguistics. While many studies have been carried out to test the power of computational methods for language subgrouping or divergence time estimation, few studies have put the data upon which these studies are based to a rigorous test. Here, we conduct a first experiment that tests the robustness of concept translation as an integral part of the compilation of multilingual wordlists. Investigating the variation in concept translations in independently compiled wordlists from 10 dataset pairs covering 9 different language families, we find that on average, only 83% of all translations yield the same word form, while identical forms in terms of phonetic transcriptions can only be found in 23% of all cases. Our findings can prove important when trying to assess the uncertainty of phylogenetic studies and the conclusions derived from them.

Paper Structure

This paper contains 14 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Location of the languages investigated in our study. For each of the 70 languages, two wordlists were identified in the Lexibank repository.
  • Figure 2: Comparison of all language pairs in the sample.