Table of Contents
Fetching ...

Deciphering Assamese Vowel Harmony with Featural InfoWaveGAN

Sneha Ray Barman, Shakuntala Mahanta, Neeraj Kumar Sharma

TL;DR

The paper investigates learning Assamese vowel harmony directly from raw speech using Featural InfoWaveGAN (fiwGAN). It demonstrates that the model captures iterative long-distance, regressive harmony and even lexical learning, producing both harmonic and illicit outputs that reflect human-like acquisition patterns. Statistical analyses (linear mixed-effects and regression) provide evidence that a [+high,+ATR] vowel acts as a trigger, with the model exhibiting right-to-left harmony and meaningful generalization beyond the training data. The work highlights the potential and limits of unsupervised phonotactic learning from continuous speech and points to data augmentation and cross-language testing as avenues for future improvement.

Abstract

Traditional approaches for understanding phonological learning have predominantly relied on curated text data. Although insightful, such approaches limit the knowledge captured in textual representations of the spoken language. To overcome this limitation, we investigate the potential of the Featural InfoWaveGAN model to learn iterative long-distance vowel harmony using raw speech data. We focus on Assamese, a language known for its phonologically regressive and word-bound vowel harmony. We demonstrate that the model is adept at grasping the intricacies of Assamese phonotactics, particularly iterative long-distance harmony with regressive directionality. It also produced non-iterative illicit forms resembling speech errors during human language acquisition. Our statistical analysis reveals a preference for a specific [+high,+ATR] vowel as a trigger across novel items, indicative of feature learning. More data and control could improve model proficiency, contrasting the universality of learning.

Deciphering Assamese Vowel Harmony with Featural InfoWaveGAN

TL;DR

The paper investigates learning Assamese vowel harmony directly from raw speech using Featural InfoWaveGAN (fiwGAN). It demonstrates that the model captures iterative long-distance, regressive harmony and even lexical learning, producing both harmonic and illicit outputs that reflect human-like acquisition patterns. Statistical analyses (linear mixed-effects and regression) provide evidence that a [+high,+ATR] vowel acts as a trigger, with the model exhibiting right-to-left harmony and meaningful generalization beyond the training data. The work highlights the potential and limits of unsupervised phonotactic learning from continuous speech and points to data augmentation and cross-language testing as avenues for future improvement.

Abstract

Traditional approaches for understanding phonological learning have predominantly relied on curated text data. Although insightful, such approaches limit the knowledge captured in textual representations of the spoken language. To overcome this limitation, we investigate the potential of the Featural InfoWaveGAN model to learn iterative long-distance vowel harmony using raw speech data. We focus on Assamese, a language known for its phonologically regressive and word-bound vowel harmony. We demonstrate that the model is adept at grasping the intricacies of Assamese phonotactics, particularly iterative long-distance harmony with regressive directionality. It also produced non-iterative illicit forms resembling speech errors during human language acquisition. Our statistical analysis reveals a preference for a specific [+high,+ATR] vowel as a trigger across novel items, indicative of feature learning. More data and control could improve model proficiency, contrasting the universality of learning.
Paper Structure (10 sections, 3 figures, 5 tables)

This paper contains 10 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: An illustration of fiwGAN architecture used in this work. We chose $N=82$ spoken utterances corresponding to lexical items in the Assamese language. The latent space contains 93 uniformly distributed latent variables, z, and 7 binary features ($\phi$) for 27=128 lexical classes.
  • Figure 2: Spectrograms of three fiwGAN generated audio files: (a)[p dobi], also an illicit item; (b) [debeku], also an innovative item following long-distance iterative harmony; and (c)[korisuw] a novel word with lexical meaning in the Assamese language.
  • Figure 3: F1 comparison of [podobi] (training data; shown in bars) and [p dobi] (generated data; shown in hatched bars). Here, o1 and o2 denote the first and second vowel, and i denotes the third vowel, in the input training data "podobi".