Neural Proto-Language Reconstruction
Chenxuan Cui, Ying Chen, Qinxin Wang, David R. Mortensen
TL;DR
The paper addresses automated proto-language reconstruction, focusing on Middle Chinese with WikiHan, and proposes three neural strategies to enhance reconstruction: data augmentation to recover missing reflexes, a VAE-augmented Transformer to enforce forward-backward consistency, and a variational NMT framework conditioned on daughter languages. Data augmentation leverages reflex prediction (CNN) and character-level transduction to enrich training data, while the VAE-Transformer integrates a forward reconstruction path to encourage a regular latent space; the Variational-NMT approach grounds proto-form generation in a conditional latent space. Empirical results on WikiHan show that data augmentation stabilizes training and the VAE-Transformer improves over a baseline Transformer (with significant gains across multiple runs), though does not consistently beat the best RNN model; VNMT variants yield competitive performance as well. Overall, the work demonstrates that incorporating linguistic regularities and data augmentation into neural models can meaningfully improve proto-language reconstruction and points to promising directions for integrating VAE techniques with strong baselines and leveraging all daughter forms.
Abstract
Proto-form reconstruction has been a painstaking process for linguists. Recently, computational models such as RNN and Transformers have been proposed to automate this process. We take three different approaches to improve upon previous methods, including data augmentation to recover missing reflexes, adding a VAE structure to the Transformer model for proto-to-language prediction, and using a neural machine translation model for the reconstruction task. We find that with the additional VAE structure, the Transformer model has a better performance on the WikiHan dataset, and the data augmentation step stabilizes the training.
