Table of Contents
Fetching ...

Over-representation of phonological features in basic vocabulary doesn't replicate when controlling for spatial and phylogenetic effects

Frederic Blum

TL;DR

The paper challenges prior claims that basic vocabulary universally over-represents certain phonological features by reanalyzing with a much larger, Lexibank-derived sample and by explicitly modeling phylogenetic and areal dependencies. Using a Bayesian Dirichlet framework with dual multilevel intercepts for genealogy and geography, the study finds that most previously reported patterns do not hold when dependencies are controlled, with only a small subset (notably some pronouns and body terms) remaining robust. The work demonstrates the importance of robustness analyses and large-scale, bias-controlled sampling for typological generalizations, and it provides open data and code to facilitate replication. Overall, the results temper claims of widespread sound symbolism in basic vocabulary and highlight the nuanced, context-dependent nature of such patterns across languages.

Abstract

The statistical over-representation of phonological features in the basic vocabulary of languages is often interpreted as reflecting potentially universal sound symbolic patterns. However, most of those results have not been tested explicitly for reproducibility and might be prone to biases in the study samples or models. Many studies on the topic do not adequately control for genealogical and areal dependencies between sampled languages, casting doubts on the robustness of the results. In this study, we test the robustness of a recent study on sound symbolism of basic vocabulary concepts which analyzed 245 languages.The new sample includes data on 2864 languages from Lexibank. We modify the original model by adding statistical controls for spatial and phylogenetic dependencies between languages. The new results show that most of the previously observed patterns are not robust, and in fact many patterns disappear completely when adding the genealogical and areal controls. A small number of patterns, however, emerges as highly stable even with the new sample. Through the new analysis, we are able to assess the distribution of sound symbolism on a larger scale than previously. The study further highlights the need for testing all universal claims on language for robustness on various levels.

Over-representation of phonological features in basic vocabulary doesn't replicate when controlling for spatial and phylogenetic effects

TL;DR

The paper challenges prior claims that basic vocabulary universally over-represents certain phonological features by reanalyzing with a much larger, Lexibank-derived sample and by explicitly modeling phylogenetic and areal dependencies. Using a Bayesian Dirichlet framework with dual multilevel intercepts for genealogy and geography, the study finds that most previously reported patterns do not hold when dependencies are controlled, with only a small subset (notably some pronouns and body terms) remaining robust. The work demonstrates the importance of robustness analyses and large-scale, bias-controlled sampling for typological generalizations, and it provides open data and code to facilitate replication. Overall, the results temper claims of widespread sound symbolism in basic vocabulary and highlight the nuanced, context-dependent nature of such patterns across languages.

Abstract

The statistical over-representation of phonological features in the basic vocabulary of languages is often interpreted as reflecting potentially universal sound symbolic patterns. However, most of those results have not been tested explicitly for reproducibility and might be prone to biases in the study samples or models. Many studies on the topic do not adequately control for genealogical and areal dependencies between sampled languages, casting doubts on the robustness of the results. In this study, we test the robustness of a recent study on sound symbolism of basic vocabulary concepts which analyzed 245 languages.The new sample includes data on 2864 languages from Lexibank. We modify the original model by adding statistical controls for spatial and phylogenetic dependencies between languages. The new results show that most of the previously observed patterns are not robust, and in fact many patterns disappear completely when adding the genealogical and areal controls. A small number of patterns, however, emerges as highly stable even with the new sample. Through the new analysis, we are able to assess the distribution of sound symbolism on a larger scale than previously. The study further highlights the need for testing all universal claims on language for robustness on various levels.

Paper Structure

This paper contains 17 sections, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Prior distributions for all model parameters.
  • Figure 2: Correlation scatter plot between old and new models for all concepts and their phonological features. The mean values for all phonological features across all concepts are directly compared with each other. The dashed line indicates a perfect correlation, whereas the red line represents a fitted linear model with three cubic B-splines.
  • Figure 3: Comparison of old and new data mean effect sizes for all phonological categories. Grey circlesdots indicate that the value is from the original results, while the colored trianglesdots are the new results. The x-axis represents the individual phonological feature, grouped by category. The grouping is indicated by coloring. The shaded area around zero indicates the ROPE. The five highest values of the new results are labeled.
  • Figure 4: Comparison of results for 'backness'. Red coloring indicates strong, and blue coloring weak results. Doubtful results are greyed out. The circle indicates the mean value, wheres the line shows the 95% posterior distribution.
  • Figure 5: Comparison of results for 'roundedness'.
  • ...and 5 more figures