A symbolic Perl algorithm for the unification of Nahuatl word spellings
Juan-José Guzmán-Landa, Jesús Vázquez-Osorio, Juan-Manuel Torres-Moreno, Ligia Quintana Torres, Miguel Figueroa-Saavedra, Martha-Lorena Avendaño-Garrido, Graham Ranger, Patricia Velázquez-Morales, Gerardo Eugenio Sierra Martínez
TL;DR
The paper tackles the NLP challenges of Nahuatl by addressing polyorthography with a symbolic unification strategy that encodes linguistic normalization rules as regular-expression patterns. Using the $π$-yalli corpus, it evaluates the unification’s impact on semantic similarity tasks at both word and sentence levels, comparing large language models and static embeddings. Results show that unigraphy generally improves performance, with static models approaching or surpassing some LLMs in certain settings and tasks, and the best gains observed with enhanced preprocessing. The work demonstrates the practical value of standardized orthography for low-resource languages and outlines future directions to extend rules and hybridize symbolic and neural approaches, with code available on GitHub.
Abstract
In this paper, we describe a symbolic model for the automatic orthographic unification of Nawatl text documents. Our model is based on algorithms that we have previously used to analyze sentences in Nawatl, and on the corpus called $π$-yalli, consisting of texts in several Nawatl orthographies. Our automatic unification algorithm implements linguistic rules in symbolic regular expressions. We also present a manual evaluation protocol that we have proposed and implemented to assess the quality of the unified sentences generated by our algorithm, by testing in a sentence semantic task. We have obtained encouraging results from the evaluators for most of the desired features of our artificially unified sentences
