Table of Contents
Fetching ...

Neural Proto-Language Reconstruction

Chenxuan Cui, Ying Chen, Qinxin Wang, David R. Mortensen

TL;DR

The paper addresses automated proto-language reconstruction, focusing on Middle Chinese with WikiHan, and proposes three neural strategies to enhance reconstruction: data augmentation to recover missing reflexes, a VAE-augmented Transformer to enforce forward-backward consistency, and a variational NMT framework conditioned on daughter languages. Data augmentation leverages reflex prediction (CNN) and character-level transduction to enrich training data, while the VAE-Transformer integrates a forward reconstruction path to encourage a regular latent space; the Variational-NMT approach grounds proto-form generation in a conditional latent space. Empirical results on WikiHan show that data augmentation stabilizes training and the VAE-Transformer improves over a baseline Transformer (with significant gains across multiple runs), though does not consistently beat the best RNN model; VNMT variants yield competitive performance as well. Overall, the work demonstrates that incorporating linguistic regularities and data augmentation into neural models can meaningfully improve proto-language reconstruction and points to promising directions for integrating VAE techniques with strong baselines and leveraging all daughter forms.

Abstract

Proto-form reconstruction has been a painstaking process for linguists. Recently, computational models such as RNN and Transformers have been proposed to automate this process. We take three different approaches to improve upon previous methods, including data augmentation to recover missing reflexes, adding a VAE structure to the Transformer model for proto-to-language prediction, and using a neural machine translation model for the reconstruction task. We find that with the additional VAE structure, the Transformer model has a better performance on the WikiHan dataset, and the data augmentation step stabilizes the training.

Neural Proto-Language Reconstruction

TL;DR

The paper addresses automated proto-language reconstruction, focusing on Middle Chinese with WikiHan, and proposes three neural strategies to enhance reconstruction: data augmentation to recover missing reflexes, a VAE-augmented Transformer to enforce forward-backward consistency, and a variational NMT framework conditioned on daughter languages. Data augmentation leverages reflex prediction (CNN) and character-level transduction to enrich training data, while the VAE-Transformer integrates a forward reconstruction path to encourage a regular latent space; the Variational-NMT approach grounds proto-form generation in a conditional latent space. Empirical results on WikiHan show that data augmentation stabilizes training and the VAE-Transformer improves over a baseline Transformer (with significant gains across multiple runs), though does not consistently beat the best RNN model; VNMT variants yield competitive performance as well. Overall, the work demonstrates that incorporating linguistic regularities and data augmentation into neural models can meaningfully improve proto-language reconstruction and points to promising directions for integrating VAE techniques with strong baselines and leveraging all daughter forms.

Abstract

Proto-form reconstruction has been a painstaking process for linguists. Recently, computational models such as RNN and Transformers have been proposed to automate this process. We take three different approaches to improve upon previous methods, including data augmentation to recover missing reflexes, adding a VAE structure to the Transformer model for proto-to-language prediction, and using a neural machine translation model for the reconstruction task. We find that with the additional VAE structure, the Transformer model has a better performance on the WikiHan dataset, and the data augmentation step stabilizes the training.
Paper Structure (26 sections, 1 equation, 5 figures, 7 tables)

This paper contains 26 sections, 1 equation, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Stacked input for the CNN reflex prediction model.
  • Figure 2: Example of position and type embeddings for feature-guided character-level transducer.
  • Figure 3: Model diagram of the proposed VAE-transformer model.
  • Figure 4: Model diagram of the proposed Variational-NMT model.
  • Figure 5: Attention weights on the different daughter languages. Weights are unnormalized (left) and normalized by the number of data points (right).