Table of Contents
Fetching ...

From Isolates to Families: Using Neural Networks for Automated Language Affiliation

Frederic Blum, Steffen Herbold, Johann-Mattis List

TL;DR

The paper demonstrates that neural networks trained on large cross-linguistic wordlists and grammatical features can classify languages into families, with lexical data outperforming grammar alone and combined data delivering the best performance. It validates the approach on deep genealogical cases (Indo-European, Sino-Tibetan, Uto-Aztecan) and explores affiliation of isolates and historical unaffiliated data, showing both strengths and limitations. The work provides a scalable, data-driven tool to inform traditional historical linguistic hypotheses while complementing, not replacing, the comparative method. It highlights the value of Lexibank and Grambank datasets for automated language affiliation and outlines practical paths for applying this approach to isolates and ancient data.

Abstract

In historical linguistics, the affiliation of languages to a common language family is traditionally carried out using a complex workflow that relies on manually comparing individual languages. Large-scale standardized collections of multilingual wordlists and grammatical language structures might help to improve this and open new avenues for developing automated language affiliation workflows. Here, we present neural network models that use lexical and grammatical data from a worldwide sample of more than 1,000 languages with known affiliations to classify individual languages into families. In line with the traditional assumption of most linguists, our results show that models trained on lexical data alone outperform models solely based on grammatical data, whereas combining both types of data yields even better performance. In additional experiments, we show how our models can identify long-ranging relations between entire subgroups, how they can be employed to investigate potential relatives of linguistic isolates, and how they can help us to obtain first hints on the affiliation of so far unaffiliated languages. We conclude that models for automated language affiliation trained on lexical and grammatical data provide comparative linguists with a valuable tool for evaluating hypotheses about deep and unknown language relations.

From Isolates to Families: Using Neural Networks for Automated Language Affiliation

TL;DR

The paper demonstrates that neural networks trained on large cross-linguistic wordlists and grammatical features can classify languages into families, with lexical data outperforming grammar alone and combined data delivering the best performance. It validates the approach on deep genealogical cases (Indo-European, Sino-Tibetan, Uto-Aztecan) and explores affiliation of isolates and historical unaffiliated data, showing both strengths and limitations. The work provides a scalable, data-driven tool to inform traditional historical linguistic hypotheses while complementing, not replacing, the comparative method. It highlights the value of Lexibank and Grambank datasets for automated language affiliation and outlines practical paths for applying this approach to isolates and ancient data.

Abstract

In historical linguistics, the affiliation of languages to a common language family is traditionally carried out using a complex workflow that relies on manually comparing individual languages. Large-scale standardized collections of multilingual wordlists and grammatical language structures might help to improve this and open new avenues for developing automated language affiliation workflows. Here, we present neural network models that use lexical and grammatical data from a worldwide sample of more than 1,000 languages with known affiliations to classify individual languages into families. In line with the traditional assumption of most linguists, our results show that models trained on lexical data alone outperform models solely based on grammatical data, whereas combining both types of data yields even better performance. In additional experiments, we show how our models can identify long-ranging relations between entire subgroups, how they can be employed to investigate potential relatives of linguistic isolates, and how they can help us to obtain first hints on the affiliation of so far unaffiliated languages. We conclude that models for automated language affiliation trained on lexical and grammatical data provide comparative linguists with a valuable tool for evaluating hypotheses about deep and unknown language relations.

Paper Structure

This paper contains 20 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Classification results for all models, based on 100 runs with random seeds for the train/test split. Vertical lines indicate minimum, maximum, 25th, and 75th percentile, as well as the mean. The comparison is based on the balanced accuracy across language families to account for the difficulty of classifying small language families.
  • Figure 2: Results for the experiment on isolate affiliation. Results are limited to the first three families to which an isolate is affiliated, showing the proportion of the remaining families under the label Rest in the charts.
  • Figure 3: The left shows some of the original Carar√≠ data published by Natterer. The right shows our standardization of the same entries using the EDICTOR tool EDICTOR-3.1, with the original transcriptions given in the column 'Form'.