Table of Contents
Fetching ...

Unsupervised Classification of English Words Based on Phonological Information: Discovery of Germanic and Latinate Clusters

Takashi Morita, Timothy J. O'Donnell

TL;DR

This work investigates whether the Germanic-Latinate distinction in English can be learned from phonotactics alone, without access to word origins. It introduces a Bayesian unsupervised clustering framework that uses a Dirichlet-process prior and a trigram phoneme model to group English words from the CELEX dataset into latent sublexica, evaluated against etymology where available. The results reveal two primary clusters that largely align with Germanic and Latinate origins, recover established phonotactic generalizations (e.g., stress patterns), and uncover new patterns such as distinctive Germanic cues like [ip] and [hu], while also predicting DOC grammar more accurately than true etymology. The study provides cross-linguistic support for phonotactics-based sublexicon learning, offers data-driven hypotheses for experimental validation, and discusses limitations and extensions (prosody, long-distance dependencies, online learning) with implications for historical linguistics and cognitive modeling.

Abstract

Cross-linguistically, native words and loanwords follow different phonological rules. In English, for example, words of Germanic and Latinate origin exhibit different stress patterns, and a certain syntactic structure, double-object datives, is predominantly associated with Germanic verbs rather than Latinate verbs. As a cognitive model, however, such etymology-based generalizations face challenges in terms of learnability, since the historical origins of words are presumably inaccessible information for general language learners. In this study, we present computational evidence indicating that the Germanic-Latinate distinction in the English lexicon is learnable from the phonotactic information of individual words. Specifically, we performed an unsupervised clustering on corpus-extracted words, and the resulting word clusters largely aligned with the etymological distinction. The model-discovered clusters also recovered various linguistic generalizations documented in the previous literature regarding the corresponding etymological classes. Moreover, our findings also uncovered previously unrecognized features of the quasi-etymological clusters.

Unsupervised Classification of English Words Based on Phonological Information: Discovery of Germanic and Latinate Clusters

TL;DR

This work investigates whether the Germanic-Latinate distinction in English can be learned from phonotactics alone, without access to word origins. It introduces a Bayesian unsupervised clustering framework that uses a Dirichlet-process prior and a trigram phoneme model to group English words from the CELEX dataset into latent sublexica, evaluated against etymology where available. The results reveal two primary clusters that largely align with Germanic and Latinate origins, recover established phonotactic generalizations (e.g., stress patterns), and uncover new patterns such as distinctive Germanic cues like [ip] and [hu], while also predicting DOC grammar more accurately than true etymology. The study provides cross-linguistic support for phonotactics-based sublexicon learning, offers data-driven hypotheses for experimental validation, and discusses limitations and extensions (prosody, long-distance dependencies, online learning) with implications for historical linguistics and cognitive modeling.

Abstract

Cross-linguistically, native words and loanwords follow different phonological rules. In English, for example, words of Germanic and Latinate origin exhibit different stress patterns, and a certain syntactic structure, double-object datives, is predominantly associated with Germanic verbs rather than Latinate verbs. As a cognitive model, however, such etymology-based generalizations face challenges in terms of learnability, since the historical origins of words are presumably inaccessible information for general language learners. In this study, we present computational evidence indicating that the Germanic-Latinate distinction in the English lexicon is learnable from the phonotactic information of individual words. Specifically, we performed an unsupervised clustering on corpus-extracted words, and the resulting word clusters largely aligned with the etymological distinction. The model-discovered clusters also recovered various linguistic generalizations documented in the previous literature regarding the corresponding etymological classes. Moreover, our findings also uncovered previously unrecognized features of the quasi-etymological clusters.

Paper Structure

This paper contains 25 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Alignment between the model-discovered clusters (columns, MAP classification) and the etymological origin according to Wikipedia (rows).
  • Figure 2: The proportion of the MAP cluster assignments given to the bases of the top thirty type-frequent suffixes.
  • Figure 3: Accuracy scores of DOC grammatical prediction by the phonotactics-based clustering and the ground-truth etymology.
  • Figure 4: Alignment between the model-discovered clusters (columns, MAP classification) and the DOC grammaticality patterns (rows).
  • Figure 5: Cluster-assignment probabilities of ✓DOC-$\overline{\textsc{Lat}}$ dative verbs.
  • ...and 1 more figures